Our study is the first systematic review on this topic. This study provides evidence that the Bland-Altman method (limits of agreement) is the most popular method that has been used to measure agreement. The majority (85%) of agreement studies in this review have applied the Bland-Altman method to assess agreement, with more than half (56%) of them using only the Bland-Altman method (i.e. without any combination with other method). Our study shows that there still inappropriate applications of statistical methods to assess agreement in the medical literature.
Bland and Altman introduced the limits of agreement to quantify agreement way back in 1983
[2]. The formula for the limits of agreement is given as: Limits of Agreement

=

mean difference ±1.96 x (standard deviation of differences). The limits of agreement is dependent on the assumptions that the mean and standard deviation of the differences are constant throughout the range of measurement, and that the distribution of these differences approximately follows a normal distribution
[2]. Bland and Altman proposed a scatter plot of the difference of two measurements against the average of the two measurements and a histogram of the differences to check these assumption
[2]. The scatter plot was initially used only to check these assumptions and not for the analysis of agreement, but it has since become a graphical presentation of agreement. This plot is actually similar to the Tukey mean-difference plot
[26], which is popularly used in non-medical fields. This plot was popularized by Altman and Bland in medical statistics, and is now referred to as the Bland-Altman plot
[26]. Despite the popularity of the Bland-Altman method, Hopkins
[27] demonstrated that the Bland-Altman plot tends to incorrectly indicate the presence of systematic bias in the relationship between two measures. If a regression line was fitted to the Bland-Altman plot, it was argued that proportional bias existed if the gradient of the slope significantly differed from zero
[28]. However, Ludbrook
[28] claimed that the presence of bias in the analysis was due to some kind of statistical assumption. An approach using least-products regression to fit the regression line in the Bland-Altman plot has been claimed to eliminate the bias problem in the Bland-Altman plots
[28].
In this review, the correlation coefficient was also found to be a statistical method used to measure agreement. Correlation Coefficient (r) reflects the noises and direction of linear relationship
[4]. Perfect correlation occurs if all the points lie along a straight line. If we compare two instruments (A and B) with variable Y as the reading from instrument A and X as the reading from instrument B, it is possible to have perfect correlation (r

=

1) for both situations of Y

=

X or Y

=

2X. However in terms of agreement, we can say that there is an agreement in the first case of Y

=

X, but not for Y

=

2X. It is obvious that the value of Y is twice the value of X (i.e. no agreement). demonstrates this. The correlation coefficient r for the relationship between variables A and B is 0.9798. Although the variable C is twice the value of B, the correlation coefficient of A and C is exactly the same (r

=

0.9798). It is obvious that there is no agreement between A and C, but the correlation coefficient value is still very high suggesting a strong correlation or association. Therefore, it is clear that the correlation coefficient does not represent agreement.
| Table 4Data sets to demonstrate the inappropriate use of correlation coefficient in testing agreement. |
Some people proceed to regression analysis as an extension to correlation analysis to answer their question of agreement. They use the coefficient of determination (r
2) as a measure of agreement. Again, this is inappropriate because coefficient of determination (r
2), being related to the correlation coefficient relies on a similar concept and is thus not suitable for assessing agreement. Coefficient of determination (r
2) is used to state the proportion of variance in the dependent variables that is explained by the regression equation or model
[4]. The more closely the points in the scatter diagram are dispersed around the regression line, the higher the proportion of variation will be explained by the regression line, thus the greater the value of r
2
[4].
The third most popular method found in this review is comparing means of readings from two instruments. Paired t-test is usually used to test the significant differences between the means of two sets of data, to assess the agreement
[2]. People have interpreted that non-significant results mean no differences, thus there is an agreement between the two groups and vice versa. However, the paired t-test with non-significant result does not indicate agreement. The reason for this is that the value of mean is affected by the value of each data, which leads to undue influence by extremely large or extremely small values. It is possible that poor agreement between the two instruments can be hidden in the distribution of differences, and thus the two methods can appear to agree. In an agreement study, we are not interested in the mean of readings by each instrument but we are interested in each individual reading. What matters is that each reading from the standard instrument should be repeated by the new instrument. Furthermore, significance is related to the power of the study.
Another method that was used to assess agreement found in this review is the intra-class correlation coefficient. The intra-class correlation coefficient (ICC) was initially devised to assess reliability
[29]. However, it was then used to assess agreement to avoid the problem of linear relationship being mistaken for agreement in product moment correlation coefficient (r)
[30],
[31]. Different assignments of measurements of X and Y in the calculation of the correlation coefficient (r), would produce different values of r. To overcome some of the limitations of the correlation coefficient (r), the ICC averages the correlations among all possible ordering of the pairs
[32]. The ICC also extends to more than two observations in contrast with the correlation coefficient (r). A number of different ICC statistics have been proposed, and there has been considerable debate about which ICC statistic is appropriate to assess agreement (30). The use of ICC in assessing agreement has also been criticized by Bland and Altman
[31]. The ICC ignores ordering and treats both methods as a random sample from a population of methods
[31]. In an agreement study, there are two specific methods that will be compared, not two instruments chosen at random from some population. Another issue with ICC is that it is influenced by the range of data. If the variance between subjects is high, the value of ICC will certainly appear to be high
[33]. Although the use of the ICC seems to be popular, the appropriateness of this method to assess agreement is also doubtful.
Often in testing for agreement, the gradient of the regression line of two variables is tested against one
[2]. The argument was that if the two methods or instruments were equivalent i.e. if it measures the same variable of the same subject both instruments will give the same reading, thus the gradient of the regression line would be one
[2]. So if instrument A measures ‘y’, and instrument B measures ‘x’, and if y

=

x, the gradient of the slope is equal to one. It is true that the regression line of y

=

x will always have gradient

=

1. However this is not always true reversely. If the gradient

=

1, the regression line could be y

=

x, or could be y

=

a +x. Therefore, solely testing the gradient

=

1 is also an inappropriate method of testing agreement. When the gradient

=

1, some people proceed to test the y-intercept. Theoretically, if gradient

=

1 and y-intercept

=

0, then y will be equal to x (y

=

x). However testing both gradient and intercept to assess agreement is not so popular compared to other methods.
The proportion of various statistical methods found in this review probably reflects the proportion of medical instruments that have been validated using those particular statistical methods in current clinical practice. Almost all methods have received criticism, including the Bland-Altman method. However, correlation coefficient, coefficient of determination, regression coefficient, and comparing means are obviously inappropriate to assess agreement. Although Altman and Bland have been highlighting the issue of inappropriateness of these statistical methods in method comparison studies since the 1980s, some of these methods were still in use in the studies which we reviewed. This study found that 20 (10%) of reviewed articles have used only these inappropriate methods to assess agreement. The equipment which has been tested using these methods may not be valid, and consequently may produce inaccurate readings. It makes uncomfortable reading that as many as one out of ten supposedly validated instruments currently used in clinical practice may not be accurate. This has the potential to affect the management of patients, quality of care given to the patient, and worse still could cost lives.
In 2009, Essack et al.
[34] conducted a study to assess the accuracy and precision of five currently available blood glucose meters in South Africa. The study compared five different types of glucometers and all the glucometers were calibrated
[34]. The authors found that although all the devices showed satisfactory precision, there was substantial discordance when their results were compared to a laboratory reference
[34]. Only three out of the five glucometers fulfilled the criteria suggested by the International Standardization Organization
[20]. The variability observed with the accuracy of glucometers can impact patient care in different settings, some of which include the diabetic patient on insulin in a home care or in a clinic setting. Inaccuracies can lead to misclassification of hypoglycaemic or hyperglycaemic episodes.
It is imperative that all medical instruments are accurate and precise. Otherwise, a failure in this regard may lead to critical medical errors. Therefore there is a necessity for proper evaluations of all medical instruments, and it is important to be sure that the appropriate statistical method has been used. The inappropriate application of statistical methods in the analysis of agreement is cause for concern in the medical field and cannot be ignored. It is important for medical researchers and clinicians from all specialties to be aware of this issue because inappropriate statistical analyses will lead to inappropriate conclusions, thus jeopardizing the quality of the evidence, which may in turn influence the quality of care given to the patient.
Of the 210 reviewed articles, only six studies were co-authored by someone working in a statistics or biostatistics department. Other studies did not state whether any assistance was sought from a statistician. One of the six studies have used correlation coefficient and comparing means to study agreement, whereas the other five studies have used the Bland-Altman method (either singly or in combination with another method). Medical researchers might need to consider assistance from a statistician in analyzing data from agreement studies. This could potentially reduce errors in data analysis, avoiding the use of inappropriate methods and improve the interpretation of results in their studies.
Recently, the guidelines for reporting reliability and agreement studies (GRRAS) have been proposed
[35]. Kottner et al.
[35] found that reporting of method comparison studies (both agreement and reliability studies) were incomplete and inadequate. Information about sample selection, study design and statistical analysis were often incomplete
[35]. We also found that even a recent article
[36] published early this year in 2012, relied on inappropriate analysis to test for agreement. The authors in this article
[36] have used the r-squared (r
2) or also known as the coefficient of determination to assess the accuracy (agreement) of glucose analyzers. In one of their results, the authors described that the Nova StatStrip device showed excellent performance that almost agreed and correlated perfectly with the lab results because the r
2
=

0.99
[36]. This suggests that there is a need for a recommendation or guideline on how to perform analysis in agreement studies.
This systematic review has several strengths. This is the first study specifically designed to retrieve information on statistical methods used to test for agreement of instruments measuring the same continuous variable in the medical literature. This study also provides supporting evidence that confirms the anecdotal claim that the Bland-Altman method is the most popular method used to assess agreement. A broad search term was used, in order to capture the largest possible number of publications on this topic. We also tried to reduce bias by using two independent reviewers during the selection of articles and data extraction. However, the results of this study may have limited generalizability due to selection bias. This review was limited to five electronic databases (Medline, Ovid, PubMed, Science Direct and Scopus) and limited to articles published only in English. The search was only performed using online databases, and as such, unpublished articles were not considered. However, these databases have a very wide coverage of published medical journals including high quality and high impact journals.
In conclusion, various statistical methods have been used to measure agreement in validation studies. This study concludes that the Bland-Altman method is the most popular method that has been used to assess agreement between medical instruments measuring continuous variables. There were also some inappropriate applications of statistical methods to assess agreement found in recent medical literature. It is important for the clinician and medical researcher to be aware of this issue because erroneous and misleading conclusions from inappropriate statistical analyses may lead to the application of inaccurate instruments in clinical practice. The issue of inappropriate analyses in agreement studies needs to be highlighted to prevent repetition of the same mistake by future researchers.