The literature review identified four main categories (each with a number of subcategories) of properties of GAF that were important in relation to its further development: (1) scaling; (2) the anchor points of GAF; (3) scoring within 10-point intervals; and (4) the number of scales.
The presentation of properties in the present study does not require any distinction between the single-scale and dual-scale GAF. When the single scale is used, 'whichever is the worse' of the symptom and functioning values is the single value recorded (according to the manual for DSM-IV-TR).
Scaling
Problems concerning measurement and scaling are fundamental in science and decisive for evaluation of interventions in health care. Scaling means quantifying qualities by assigning numbers [
45]. For psychiatry, scaling has been, and will continue to be, central to its development [
22,
46-
49]. The choice of rating scale is not indifferent: problems in scaling can be due to properties of the rating scale [
50,
51].
Continuous or categorical scale
A continuous scale has no steps and does not force the respondent to answer in specific categories [
52]. In GAF, a continuous scale (finely graded with 100 points) has been preferred to a discrete scale. With good reliability, sensitivity using continuous scales can be good for detecting change and differences. Statistical testing can show statistically significant differences for samples with small differences in the severity of illness. Continuous scales may also be applied to defining threshold values for assigning diagnoses. It is plausible that symptoms and functioning are more continuous in nature than mental illness itself. Error of measurement for such a finely graded scale may also mask a possible discontinuity of mental disorders. In GAF, the anchor points are ranked, but it is open to question whether the anchor points (with key words and examples) really constitute a natural continuum.
An alternative to a continuous scale is classification into categories with verbally formulated inclusion criteria for each category. The internationally well known symptom checklists are clear examples [
53]. The simplest way of scoring symptom and functioning items is to score present or absent [
24], but scorers can be capable of making more accurate judgements, for example by using a Likert-type scale with five categories, ranging from not present to present to a marked degree [
46,
54]. The items of a symptom checklist must be relevant for the disorder(s) to be studied (that is, a generic scale requires an all-inclusive set of symptoms). If mental disorders can be said to develop in stages, disease-staging systems could be chosen [
55-
57]. The categories are then the stages of the disease-staging system. GAF is not without similarity to categorical scales (that is, the 10 anchor points can be viewed as categories). However, it is not really known whether mental disorders are continuous or discrete in nature [
49,
58-
60].
Gap in knowledge: the development of GAF has little basis in general research on what is best for a global functioning scale (that is, a continuous or categorical scale). Little research has been performed directly on GAF concerning whether a continuous or categorical scale is better.
Visual scale
A VAS (visual analogue scale) is a line with anchor points at each end to indicate the extremes. The scorer marks a point on the scale indicating the severity of the phenomenon. The scored value is the distance from the point to the scale's lower end. The VAS has been used successfully in psychiatry, but there is no conclusive evidence that it is better than categorical scales and it takes more work to analyse [
46,
51,
53,
54,
61,
62]. When a VAS is equipped with descriptive anchor points along the line, it becomes more similar to a scale that could work as a visual scale for GAF. Technologically, it is possible to computerise scoring on a VAS by setting a mark on the screen's digital line, so the computer calculates the distance from the lower end of the line.
Gap in knowledge: we do not know whether scoring directly on a visual scale improves scoring for GAF and whether computerisation of such scoring gives better results (for example, improved reliability). If a visual scale is equipped with descriptive anchor points along the line, we do not know which anchor points will be best, how many anchor points should be used, and where along the line the anchor points should be located.
Scales and further treatment of data
Raw data from scaling and measurement often undergo statistical analysis. For such analysis, it is relevant to distinguish between four types of scales: nominal, ordinal, interval and ratio scales. Both nominal and ordinal scales are well known in psychiatry and GAF is an example of an ordinal scale. This has consequences for further treatment of data. We cannot say, for example, that a 5-point change in GAF from 38 to 43 means the same change in severity as that from 68 to 73. Mean GAF at the start of treatment minus mean GAF at the finish, for sample A, cannot be said to be larger than the same change for sample B, in spite of sample A clearly having a larger numerical difference than sample B [
22]. Similarly, it is not entirely correct to add individual scores and divide by the number of individual scores to obtain the mean value. For psychiatry, it is difficult to develop a mental health scale that reaches the level of a real interval or ratio scale, but it is quite common to see GAF data treated as something more than ordinal data. In some research projects, collected raw data for GAF are merged into a limited number of categories [
15,
63]. A simple version of this is to dichotomise the level of functioning into 'superior to fair' and 'poor to grossly impaired' [
64]. Some authors have merged their raw data into more categories (from three to seven [
15,
63,
65-
67]). It would be expected that such categorisation of a raw data set is important for conclusions drawn when the data are treated statistically. For a single scale GAF 'whichever is the worse' of an individual's symptom and functioning values is the GAF score [
68]. Also, when scoring is performed on two separate scales (GAF-S and GAF-F scales), sometimes only one score is recorded. In principle, this could be the lower, average or higher of the two scores. As GAF-S and GAF-F score different dimensions, giving just one figure is open to criticism and also means loss of information.
Gap in knowledge: when GAF data are treated as something more than ordinal data it is possible that the resulting error is small, but there has been little testing of whether the error is of any practical interest. Similarly, the error resulting from merging raw data into broader categories, and the use of just one score in GAF, have not been subjected to much scrutiny.
The anchor points of GAF
The use of symptoms and functioning as an expression of severity of illness is well known. Furthermore, psychiatric diagnoses express differences in severity, and severity can also include factors such as stage of development of the illness, intensity (for example, frequency and duration of periods with symptoms over a time period), and comorbidity [
69-
72].
The nature of anchor points
The 10 anchor points (with key words and examples of symptoms and functioning items) give a general idea on what to stress in scoring GAF. The use of examples is important and is likely to improve assessment [
73]. In Hall's 'modified GAF' a greater number of criteria for scoring are found [
28]. Items used in different symptom and functioning scoring systems are different; in further work with GAF, ideas for the best subset of items can be drawn from the literature on symptom and functioning scoring [
2,
22,
53,
74,
75].
The anchor points should give descriptions that are sufficiently close to what the clinician observes. Validity may be improved with concrete anchor points [
8]; the anchor points of GAF could be worked out with more examples. As the anchor points are ranked, we are dealing with symptoms (and also functioning) as being something unidimensional, but ranking of items is especially difficult when they are each very different.
Gap in knowledge: in the history of GAF, little change is found in the character of anchor points, key words and examples. We do not know if other anchor points, with other key words and examples, would give a better GAF. We do not know if other expressions of severity (such as stage of development of the illness, intensity, and comorbidity) could be included as scoring criteria. There has been little analysis of whether all the rankings of anchor points are correct. We have little information about potential differences in the validity and reliability for low and high scores.
Symptoms
The current symptom anchor points were generally assigned in earlier stages of development that led to the present GAF, but much symptom research has been performed since then. Symptom checklists can include questions about behavioural and somatic symptoms, and positive and negative feelings of well-being [
22,
76]. Asking about both positive feelings of well-being and somatic symptoms makes the checklist more objective; sensitivity and specificity can be good, and the intent of the measurement is concealed [
22]. As patients can have more than one symptom, with different types and degrees of development, assessments of illness severity based on such symptom clusters seems logical. Many symptoms in psychiatry have two aspects: form (for example, auditory hallucination) and content (for example, the person is told to do something) [
77]. In symptom-scoring systems, symptom content has been largely ignored, but perhaps it should not be [
73].
Gap in knowledge: the considerable body of symptom research has played a limited role in the development of GAF. It is possible that anchor points, key words and examples for anchor points could be improved by learning from symptom research. Symptom clusters, with different degrees of severity for each symptom, have been little evaluated for scoring in GAF. A change in symptom anchor points could have an effect on scoring within 10-point intervals. There has been little evaluation of symptom content as a criterion for scoring illness severity.
Functioning
A large number of indices of functioning have been constructed [
17,
22,
74,
78]. Functional status can be defined as the degree to which an individual is able to perform socially allocated roles free of mentally (or physically) related limitations [
74]. A measure of functioning requires decisions about: which type of functioning should be scored (for appraisal of overall functioning, several types of functioning should be scored, for example difficulties with participation in working life, daily activities, and social relationships); how to grade each type of functioning; and whether an aggregate measure can be made (that is, the total score expressed with one figure).
When functioning is scored in psychiatry, impairments with a somatic background should be excluded [
23,
26], but GAF-F values can be the result of combined mental disorder and somatic disease; some illnesses have a psychosomatic background and somatic diseases can be followed by a psychological reaction. When scoring is carried out for longer time periods, such as 1 year, it can be difficult to attribute functioning values to mental status alone [
17].
When a GAF-F value has been assigned, this should mean that the patient is not able to perform tasks that are higher on the scale, but early support can be associated with improved functioning measured by GAF [
30] (that is, support from healthcare, or family and friends). A patient having problems with functioning at work can achieve a better score by moving to a new job. An advantage with scoring of functioning is that it can be more easily applied across diagnostic groups [
35].
Gap in knowledge: the considerable international research on functioning has played a limited role in the development of GAF. It is possible that anchor points, keywords and examples for anchor points, and scoring within 10-point intervals could be improved by learning from research on functioning. Little analysis has been carried out of different combinations of types, number, and grading of functioning anchor points, and further work is needed to determine the optimal reliability, validity, sensitivity and generic properties of the anchor points.
Positive mental health
In psychiatry, there is a preoccupation with mental illness, but less interest in positive mental health [
70,
79]. Positive and negative feelings are not simply opposite ends of a single-dimension scale [
22]. It could be discussed whether the scoring of GAF should include factors such as life satisfaction, positive quality of life, psychological well-being, and even physical fitness [
70,
71,
74]. Inclusion of questions about 'positive mental health' may be important for prediction of the ability to improve after an episode of mental illness.
Gap in knowledge: a further development of GAF could include a search for indicators of positive mental health. It is possible that inclusion of positive health factors will improve the choice of 10-point interval, and the scoring within 10-point intervals. Different combinations of the types, number and grading of positive health factors have not been analysed to obtain the best possible reliability, validity, sensitivity and generic properties. In addition, there has been little assessment of different combinations of positive and negative feelings in the scoring.
Prognosis
The present GAF has limited value for assessing prognosis [
63], and other systems predict prognosis better [
25,
36,
53]. Prognosis is definable as a part of the severity of illness. A patient who is severely ill with a good prognosis can then be scored more highly than a patient who is less severely ill with a poor prognosis. Prognosis can be related to the patient's resources and not just the patient's problems and is more dependent on diagnosis and symptoms than impairment ratings: the highest level of functioning for a time period is more important for prognosis than the lowest, and substance abuse plays a role [
15,
70,
71,
74].
Gap in knowledge: prognosis has not been much considered as a criterion for scoring in GAF. In the further development of GAF, prognosis may be considered as a criterion for scoring.
Generic properties
In the DSM-IV-TR, there is an overlap between criteria for diagnoses and criteria for GAF scoring. A relationship with diagnoses can be expected for GAF [
15,
26,
32,
34,
63,
80,
81], but DSM is a multiaxial system [
32] where each axis is intended to add information. In their work with GAS, Endicott
et al. [
18] wanted to remove all diagnostic criteria. A different strategy would be to develop different criterion sets for different diagnoses (for example, for dementia and depression). The use of diagnosis-specific symptoms and functioning criteria for GAF scoring could improve the generic properties of GAF.
GAF was intended to be used for both for adults and children [
14], but a specific version for children has been developed. The Children's Global Assessment Scale has anchor points that are especially relevant for children [
82].
Gap in knowledge: reviews showing strengths and limitations of GAF's generic properties are difficult to find. Such reviews could form the basis for change in anchor points, for example by adding criteria that are relevant for diagnoses where scoring of GAF is difficult due to lack, or low relevance, of criteria. Reviews of GAF's generic properties could also give information that is important for construction of specialised GAF scales for patient groups that are poorly covered by the present GAF.
Exclusion criteria
The anchor points are generally inclusion criteria for scoring in 10-point intervals. Little work has been performed to identify exclusion criteria for scoring in each interval. An example would be identification of symptoms (or grading of symptoms) that exclude scoring in the GAF-S interval 51-60 and make the interval 41-50 preferable. Proposing that the anchor points of neighbouring 10-point intervals are exclusion criteria may be too simple an answer.
Gap in knowledge: in the history of GAF, little work has been performed to elucidate exclusion criteria for scoring in each interval. A further development of GAF could include a search for specific exclusion criteria.
Extremes of the GAF
The GAF scale identifies the lowest and highest levels for a hierarchy of mental illness. The choice of anchor points at the endpoints is decisive for the variation in possibilities of a phenomenon, as endpoints can influence which score is given [
62]. In scoring of morbidity, perfect health often marks one extreme. In GAF-S, the other extreme is persistent danger of severely hurting themselves or others, and in GAF-F it is persistent inability to maintain minimal personal hygiene. In a disease-staging system, death was chosen as the lower endpoint for a number of psychiatric conditions [
55]. However, not all health states can be placed upon a continuum bounded by the anchor points 'perfect health' and 'death' [
62]. Patients themselves can consider some conditions worse than death [
52,
62]. In the Kennedy Axis V's subscale for psychological impairment, criteria have been added to the GAF criteria, such as 'totally insensitive to the feelings and need of others' (the lowest interval) [
83]. The first step in work with a scaling instrument should be to define its endpoints.
Gap in knowledge: we know little about the influence on GAF scores of using other anchor points at the endpoints of the scale.
Number of anchor points
The 100 scoring possibilities in GAF and the low detail of verbal instructions are in conflict with each other. Equipping GAF with a higher number of anchor points could be considered [
10]. In general, the middle range is frequently used in psychiatry, and more elaborate verbal instructions for the middle range could be considered [
82]. For newly admitted inpatients, higher scorings are rarely used, which gives relevance to having more anchor point for the lower range [
18]. In community studies, the upper part of the scale is most relevant, and so the question of having more anchor points for the upper range also comes up. When scoring of GAF is computerised, links can be visible on the screen and clicking on these links gives more detailed information (for example, for scoring newly admitted inpatients and for community studies).
Gap in knowledge: systematic testing of different changes in the number of anchor points (and their distribution over the total scale) to obtain a better GAF is difficult to find in the history of GAF.
Scoring within 10-point intervals
Endicott
et al. [
18] and the manual for DSM-IV-TR give instructions for scoring within 10-point intervals, but instructions are limited. In practice, clinicians tend to score around the decile, or mid-decile, divisions of the scale [
16]. When information for a more accurate score is lacking, intermediate scores in the deciles are chosen [
21,
51].
For improved scoring within the 10-point intervals of current GAF, three tools can be considered: more detailed verbal instructions, development of categorical scales for scoring within the 10-point intervals, and the number of criteria met to decide a score within a 10-point interval.
More detailed verbal instructions
More detailed verbal instructions could be developed with the intention of improving scoring within 10-point intervals, that is, more anchor points (more keywords and examples) specified to improve scoring within 10-point intervals.
Development of categorical scales
Categorical scales could be developed to improve scoring within 10-point intervals. This means grading of anchor points (with key words and examples of symptoms and functioning items). Categorical scales often have five categories, such as 'very marked', 'marked', 'neither marked nor weak', 'weak' and 'very weak'. Although functioning scored by a 5-point scale can have good reliability [
84], the optimum number of categories may be five to seven, or more [
24,
46,
50,
51,
54].
Number of criteria met
An alterative procedure for scoring within 10-point intervals is found in the 'modified GAF' [
28]. The number of criteria met is used, for example for the interval 41-50: when one criterion is met the score should be 48-50 and when two criteria are met the score should be 44-47.
Gap in knowledge: in the history of GAF, systematic work to improve scoring within 10-point intervals is limited. This also applies to evaluation of categorical scales for the purpose. Such application of categorical scaling would require consideration of the nature and number of categories.
The number of scales
When GAF is scored according to the instructions in the DSM-IV-TR, only one figure is given, but both symptoms and functioning are assessed. However, the recording of only one figure means there is a lack of knowledge about which dimension is represented. Patients can present a complexity that is better described by having two scales (separate GAF-S and GAF-F scales) [
10,
17,
26,
35,
85].
GAF with two scales
Reliability and validity studies for both GAF-S and GAF-F scales exist, but there are relatively few [
2,
8-
10,
15,
26,
30]. In psychiatry, symptoms and functioning are often closely related [
15,
17,
26,
63], but have been proposed to deviate frequently enough to recommend measuring both in outcome studies [
17,
35]. Functioning can improve without a corresponding symptom improvement and vice versa [
35]. GAF-S and GAF-F can be correlated with r = 0.61 [
10]. When GAF-S scores share more variation with other measures of symptoms and GAF-F scores share more variation with other measures of functioning [
10], this suggests that GAF-S and GAF-F represent different aspects of a patient's condition. Few studies have focused on concurrent validity of GAF-S and GAF-F separately, but the association between GAF-F and other types of functioning may be low [
10,
15,
30,
63]. In general, we have little empirical knowledge about the advantage of separate scores for symptoms and functioning, for example, for assessment of treatment need and measurement of outcome [
10]. The clinical significance, when GAF-S and GAF-F are clearly different, has also been little explored.
Gap in knowledge: we know little about the advantage of using GAF with symptom and functioning scales separately. The symptom and functioning scales of GAF should score different dimensions, but the scores should still be correlated. Search for the right combination of definitions of GAF-S and GAF-F is limited. More study should be performed of reliability and validity for both GAF-S and GAF-F scales individually.
GAF with more than two scales
In the latest version of the DSM (DSM-IV-TR), two extra scales were provided for further study: the Global Assessment of Relational Functioning Scale (GARF) and the Social and Occupational Functioning Assessment Scale (SOFAS). The Mental Illness Research, Education & Clinical Center (MIRECC) GAF has three scales: for symptom severity, occupational functioning, and social functioning [
8]. In the Kennedy Axis V, the seven subscales provide a broad profile of the patient [
83]. GARF, SOFAS [
5,
26,
29,
86], MIRECC GAF [
8], and Kennedy Axis V [
83] all make more information available to the clinician. If the number of scales is increased, there may be a longer learning time for the scoring method, scoring becomes more time consuming and less easy to use, with analysis of the results becoming more complex (for example for outcome). International diffusion of these scales has been modest.
Gap in knowledge: the advantage of a GAF split into two scales should be investigated more thoroughly before discussing a system with more than two scales. Research on GAF with more than two scales is limited. For example, more study of reliability and validity is necessary, as well as studies of what can be gained and lost by using more than two scales. It seems premature to let such systems replace the current GAF.
Further development of GAF
For work with a new GAF, some overall goals can be formulated: (1) the scale should continue to cover the range from positive mental health to severe psychopathology; (2) it should continue to be a global measure for how patients are doing; (3) the generic properties should be improved; (4) a new GAF should add information compared to the other axes of the DSM-IV-TR; (5) reliability should be improved or at least not reduced; (6) validity should be improved; (7) sensitivity should be analysed, compared to other scaling methods, and found to be good enough for the purpose; (8) the new system should make sense to clinicians; and (9) scoring should be fast and easy. The goals are ambitious, but not necessarily impossible to combine.
Methodology studies of the design of questionnaires demonstrate the significance of variation in instrument properties for scoring results [
50]. The design of scoring instruments for psychiatry shows the same importance of instrument properties for the scoring result [
22,
24,
58,
74]. In the historic development of GAF, little study of systematic variation in system properties has been carried out. The study by Hall [
28] could have been a start (showed that change in properties can improve GAF), but it has been little followed up. The significance of the gaps in knowledge is an empirical question that can be investigated. Many alternative forms of a new GAF could be examined (with both with major and minor changes). It is difficult to forecast which changes are likely to provide the most significant improvements. Researchers should be aware that even seemingly minor changes can have a major impact [
87]. Reliability and validity are connected [
10]. For example if validity is improved by a change in the properties of an instrument, reliability may change (with uncertain direction).
The many application possibilities of GAF have not been widely studied. For GAF to function well in different applications, different changes may be required. Psychometric characteristics are not properties of an instrument
per se, but rather properties of an instrument when used for a specific purpose with a specific sample [
88].
For a new GAF, scoring should be completely computerised. The electronic patient record makes new quality assurance methods possible. For example, some diagnoses are incompatible with high GAF scores. If such a diagnosis has been given, a warning could pop up on the screen if too high a GAF score is given. A correlation is expected between what is scored in a symptom checklist and GAF scoring. A warning could pop up on the screen if this correspondence is lacking.
Construction of health scales requires much work. A new GAF should be subjected to rigorous testing of validity and reliability. Work with a scoring instrument is not complete until it has been tested in a pilot study [
52].