Considerable interest exists in the concept that physicians can assess each other across a range of qualities (for example, integrity, compassion, respect, and responsibility), but this review shows that the instruments developed for peer assessment have not been developed in accordance with best practice. The principles of instrument design involve giving attention to theoretical frameworks and construct clarification in order to establish validity as the basis for reliability studies. These steps are not described for the instruments we identified.
As far as we are aware this is the first review of instruments for peer appraisal of practising physicians. We followed standard methods, with reliance on key author contacts and secondary references. The emerging nature of the field limits the effectiveness of the database searches.
Caution is needed when developing quantitative measures that use peers to rate complex humanistic qualities, and the complex nature of this field should be acknowledged.25
A common theme in the assessment literature is the question of self evaluation versus external evaluation and whether “others” can form judgments on differing facets of professional practice.26
“Social comparison theory” acknowledges the drive to self evaluate, using similar others as a benchmark,27-29
and recognises the construct of “managerial self awareness” as a process of self reflection using feedback,30
allowing us to “see as others see.”26
The validity of “others” as appropriate rater groups remains a challenge for research, because criteria and frames of reference, even if defined explicitly, will vary with each individual.6,11,18
In other words, how many “true” peers do professionals have? How many peer colleagues are in positions of having accurate knowledge about an individual's performance in terms of compassion, responsibility, or respect, so that they can make informed judgments?
The wider literature also draws a distinction between “task performance” versus “contextual performance.”31
This dichotomy seems to parallel the distinction between performance ratings (as task) and 360 degree or multisource feedback (as context); thus one author feels that multisource feedback and performance ratings may be separate constructs.32
The point here is that the instruments developed for peer rating of physicians have not explicitly allowed for the distinction that can be identified between “task” and “contextual” performance and their effects on ratings.
The other key issue is the perceived fairness of the peer appraisal process. Procedural justice theory suggests that people naturally make judgments on how decisions are arrived at (procedural justice) quite separately from judgments on outcomes of decisions (distributive justice).33
Procedural justice is seen to be more important than outcome in terms of overall acceptability and an essential element of validity. This initial judgment on fairness also sets a frame of reference for interpreting subsequent events that has a crucial and enduring influence.33
Doubts about the face validity of arriving at a judgment on a peer's compassion or integrity risk jeopardising the peer appraisal process through negative perceptions, which could be difficult to overcome subsequently. Evidence seems to exist that an appraisal process, once underway, enters a feedback loop of success that quickly becomes positive or negative with no safe middle ground.25
The identified instruments would need to consider procedural justice by demonstrating their validity through clearly defined criteria and constructs relevant to the rater groups.
Face validity—indicates whether an instrument “seems” to either the users or designers to be assessing the correct qualities. It is essentially a subjective judgment.
Content validity—a judgment by one or more “experts” as to whether the instrument samples the relevant or important “content” or “domains” within the concept to be measured. An explicit statement by an expert panel should be a minimum requirement for any instrument. However, to ensure that the instrument is measuring what is intended, methods that go beyond peer judgments are usually needed.
Criterion validity—usually defined as the correlation of a scale with some other measures of the trait or disorder under study (ideally a “gold standard” in the field).
Construct validity—refers to the ability of the instrument to measure the “hypothetical construct,” which is at the heart of what is being measured. Where a gold standard does not exist (as is the case for measuring humanistic qualities such as compassion, integrity, responsibility, and respect), construct validity is determined by designing experiments that explore the ability of the instrument to “measure” the construct in question. This is often done by applying the scale to different populations, which are known to have differing amounts of the property to be assessed. By conducting a series of converging studies, the construct validity of the new instrument can be determined.
Internal consistency—assumes that the instrument is assessing one dimension or concept and that the scores in individual items would be correlated with scores in all other items. These correlations are usually calculated by comparing items.
Stability—an assessment of the ability of the instrument to produce similar results when used by different observers (inter-rater reliability) or by the same observer on different occasions (intrarater reliability). Test-retest reliability assesses whether the instrument produces the same result if used on the same sample on two separate occasions.
Where measurements are undertaken in complex interactions by multiple raters, the production of reliability coefficients by using generalisability theory is advocated.
What is already known on this topic
The range of professional competences and qualities now recognised as necessary in a good physician is not adequately assessed by conventional examinations and assessments
Suitable methods are needed to assess the broader range of competences, including “humanistic” qualities and professionalism
Peers are one potential source of assessment of these aspects of physicians' practice
What this study adds
Very few instruments designed for peer assessment of physicians exist, and their development so far has focused on reliability and feasibility
The available instruments lack theoretical frameworks, and their validity remains questionable
Clarity of purpose is a key determinant of the subsequent “success” of peer appraisal but may be lost by confounding summative and formative aims
Concern has been voiced about the validity of peer evaluation. If the validity of peer ratings remains unclear, then, as Saturno reminds us, reliability and feasibility are no substitute.3,8,15,16
A possible approach has been initiated in Finland, where qualitative methods have been used to begin to characterise some of the concepts and constructs relevant to peer appraisal that are needed before quantitative tools are developed.34
When peers have attempted to rate humanistic qualities, the validity has not been well supported by empirical findings. The poor agreement between observers of the same events is shown by several studies.5-8,35
An argument is emerging that the most valid source of ratings for humanistic dimensions are patients,5,6,10,30
because only they have experienced certain qualities, such as “a level of intimacy,” not available to other raters such as peers.30
Implications for policy
Quality “improvement” using formative developmental appraisal and quality “assurance” using methods to identify underperformance are separate aims. The importance of being clear about the purpose has been emphasised repeatedly.25,31
Combining these separate aims may compromise such clarity. This problem seems to have confounded the development of peer assessment methods. Peers may in effect be asked to make two judgments at once, one on “quality” and one on “adequacy for purpose.” Making a judgment on adequacy presupposes knowledge about acceptable ranges for the criteria, which must be defined.35
The validity of rating items in assessing aspects such as the compassion, integrity, respect, or responsibility of a peer remains highly suspect. To have any validity or reliability, such qualities would need to be expressed as observable behaviours. In the absence of clearly defined constructs derived from a bottom-up empirical approach, and lacking a coherent theoretical framework, what is being measured here, if anything, is unclear.
Implications for research
Interest exists in using peers to assess the humanistic qualities of physicians, but the theoretical underpinning is lacking. Clarity of purpose is vital, and more attention needs to be given to the underlying constructs of interest. That judgments can be made only by those people who experience the qualities in question must be recognised. In the meantime, peer assessment methods should be used with caution.