|Home | About | Journals | Submit | Contact Us | Français|
In quantifying medical images, length-based measurements are still obtained manually. Due to possible human error, a measurement protocol is required to guarantee the consistency of measurements. In this paper, we review various statistical techniques that can be used in determining measurement consistency. The focus is on detecting a possible measurement bias and determining the robustness of the procedures to outliers.
We review correlation analysis, linear regression, Bland-Altman method, paired t-test, and analysis of variance (ANOVA). These techniques were applied to measurements, obtained by two raters, of head and neck structures from magnetic resonance images (MRI).
The correlation analysis and the linear regression were shown to be insufficient for detecting measurement inconsistency. They are also very sensitive to outliers. The widely used Bland-Altman method is a visualization technique so it lacks the numerical quantification. The paired t-test tends to be sensitive to small measurement bias. On the other hand, ANOVA performs well even under small measurement bias.
In almost all cases, using only one method is insufficient and it is recommended to use several methods simultaneously. In general, ANOVA performs the best.
This paper is motivated in part by the need to establish a reliable measurement protocol of head and neck structures involving both bony and soft tissue structures from magnetic resonance images (MRI) collected for the purpose of quantifying the growth pattern of various oral and pharyngeal structures or vocal tract structures (Vorperian et al. 2005, Vorperian et al, 2007). Figure 1 depicts a select set of such measurements obtained manually from MRI.
It is crucial to obtain accurate and reliable measurements particularly in developmental studies, and establish an accurate measurement protocol. Unfortunately, since the ground truth for manual measurements is never known, it is difficult to quantitatively determine if a given protocol produces consistent measurements. We have addressed this problem by placing reference landmarks, and obtaining repeated measures from MRIs by two trained raters. Next, using those paired measurements, we assessed the consistency of measurements of our measurement protocol. The purpose of this study is to determine the ideal analysis method to check for consistency of measurements. We will refer to this problem as the measurement consistency problem.
The measurement consistency problem occurs universally and it is of broad interest to researchers in diverse medical imaging disciplines. There are several major statistical approaches that have been used to check measurement consistency. The most widely used methods are correlation analysis, linear regression, paired t-test, and the Bland-Altman method (Krummenauer and Doll, 2000; Bland and Altman, 1986). A review on the measurement consistency problem can be found in Krummenauer and Doll (2000). Krummenauer and Doll (2000) state ?or conclude that using only one method is insufficient and several methods should be applied and compared. They also suggest making as many repeated measurements as time and cost permit for more accurate determination of measurement consistency.
In Bland and Altman (1986), the authors found that the correlation analysis, which is a popular method in establishing measurement consistency (Edvardsen et al, 2002; Liu et al, 2006; Van Oosterhout et al, 1995; Powell et al, 2000; Vallejo et al, 2000), is not appropriate. They proposed a visualization technique called the Bland-Altman method based on the difference between measurements. The detailed discussion on the Bland-Altman method can be found in Bland and Altman (1995) and Bland and Altman (2003). Braždžionytė and Macas (2007) claimed that the Bland-Altman method is more appropriate for assessing the measurement consistency, when compared to the correlation analysis and the linear regression. However, the shortcoming of the Bland-Altman’s approach is that it is a visualization technique and lacks the numerical quantification.
Abate et al (1994) used the Bland-Altman method to analyze the measurement consistency between MRI and dissection for measuring adipose tissue mass. Powell et al (2000) used both a linear regression and the Bland-Altman method to analyze the measurement consistency between ultrasonic flowmeter measurements and phase-velocity cine MRI. Edvardsen et al (2002) used a paired t-test and the Bland-Altman method to compare the measurements from Tissue Doppler echocardiography with the measurements from MRI. Liu et al (2006) used the correlation coefficient to analyze the measurement consistency between manual delineation and automated segmentation of thermal coagulation on 3-D elastographic image.
In this paper, we review various quantitative techniques for determining measurement consistency, and provide an MRI study that describes the strength and the weakness of each technique. When comparing techniques, our main focus is on detecting the measurement bias and determining robustness to outliers. We provide further guidelines for using each of technique.
MRIs from 10 male subjects between 0 and 4 years of age were used for this study. The landmarks for making measurements were placed on the MRI slice independently by two trained raters referred to as CC and RD. All landmarks and measurements were taken from the midsagittal slice of the MRI images from the imaging database. To insure unbiased placement of landmarks, RD and CC each placed landmarks on the image after suppressing the landmarks placed by the other. Thus each rater landmarked and measured the selected image independently of the other. All landmarks and measurements were made using the Sigma Scan Pro version 5 (Systat) and data was recorded onto a hard copy measurement sheet and entered into a measurement database for statistical analysis. All measurements were made in the centimeter unit.
Both CC and RD obtained measurements from ten MRIs independently at three separate times, resulting in a total of 60 measurements. These measurements were classified into four different categories: consistent, less consistent, biased, and with outliers. Of the 38 variables measured in the head and neck region, the following 6 variables are used to illustrate each case: Head length (HL), lower anterior facial height (LFH), anterior tongue length (ATL), Hyoid vertical distance from PNS (HVP), vocal tract length (VTL) and soft palate (SP). The definitions of those six variables are as follows (Figure 1).
The maximum linear distance from the glabella to the opisthocranion.
The distance from the stomion to the gnathion. If the subject has an open mouth posture, the stomion was taken as the point at the antero-superior edge of the mandibular lip.
The curvilinear distance along the dorsal superior contour of the tongue from the tongue tip to the intersection with the line dividing the hard palate and soft palate.
The curvilinear distance along the midline of the tract (i.e. the distance along the midpoints of lines drawn between the inferior and superior boundaries of the vocal tract wall) starting at the level of the true vocal fold to the intersection with a line drawn tangentially to the lips.
The vertical distance from the inferior and anterior aspect of the hyoid bone to the level of the PNS.
The curvilinear distance from the posterior edge of the hard palate to the inferior edge of the uvula -- a projection of variable length from the free inferior border of the soft palate. The criterion used to identify the end of the hard palate and the beginning of the soft palate is a line drawn at the beginning of the hard palate/soft palate overlap.
The measurement errors themselves are relatively small and measured by the average relative error defined as
where RDi and CCi be the i-th measurement of RD and CC respectively, and n = 30 be the number of measurements obtained by each rater. The average relative error for HL, LFH, ATL, HVP, VTL and SP are 0.016, 0.036, 0.041, 0.070, 0.046 and 0.1 respectively. The fairly large ARE of SP is caused by an outlier (Figure 2).
Figure 2 shows the scatter plot of the measurements of each head and neck structure. There are 30 data points on each scatter plot (three repeated measurements for 10 MRIs). The solid line (y = x) indicates the perfect consistency between two raters. Two raters measured HL and LFH consistently and most points are placed near y=x line. ATL and HVP measurements are less consistent than LFH. For VTL, most points are under y=x line and the measurements obtained by RD are biased against the measurements obtained by CC. For SP, there is an outlier caused by RD.
The correlation coefficient r measures the linear relationship between two variables, and ranges between −1 and 1. If measurements are consistent, we expect to have a strong linear relationship and, in turn, correlation value close to 1. On the other hand, if the measurements are less consistent, correlation value close to 0 is expected. Under the null hypothesis of r=0 (not consistent), the significance of correlation can be tested using a t-statistic with n − 2 degrees of freedom:
The correlation analysis has been previously used in measurement consistency (Edvardsen et al, 2002; Liu et al, 2006; Van Oosterhout et al, 1995; Powell et al, 2000; Vallejo et al, 2000). However, as we will show in the result section, it is not a proper procedure.
Alternately a linear regression can be used to determine the measurement consistency (Braždžionytė J and Macas A., 2007; Powell et al, 2000). The following regression model is used to fit measurements:
When RD and CC are consistent, we expect the regression slope β1 to be close to one. By testing if the slope is equal to one, we can quantitatively determine the consistency. The regression fit is given in Figure 2. Since the slope is proportional to the correlation coefficient, both the correlation analysis and the linear regression are equivalent approaches although this equivalence is not exploited previously (Chatterjee et al, 2000). Similarly one can test if the intercept β0 is close to zero for testing a bias of if one rater is systematically obtaining larger or smaller measurements compared to the other rater.
Although the Bland-Altman method has been discussed in various literatures (Bland and Altman, 1984; Bland and Altman, 1995; Bland and Altman, 2003; Krummenauer and Doll, 2000; Braždžionytė and Macas, 2007; Abate et al, 1994; Powell et al, 2000; Edvardsen et al, 2002), we briefly explain here for the completeness of the paper. Let di be the measurement difference, i.e. di = CCi − RDi. The measurement difference is the estimated bias of measurements between the two raters. Let and be the mean and the variance of the difference. Bland and Altman plotted di versus the average of measurements of two raters, with the reference lines, , − 1.96Sd and + 1.96Sd (Bland and Altman, 1984). The range between − 1.96Sd and + 1.96Sd provides the “limit of agreement” (Figure 3).
The weakness of the Bland-Altman method is that the measurement consistency is mainly determined visually without statistical significance attached to the plot. To give the statistical significance to the Bland-Altman method procedure, a paired t-test can be used. We test if the measurement difference is statistically small enough using the test statistic
which is distributed as the t-distribution with n − 1 degrees of freedom.
All the previous methods can determine consistency between a set of paired measurements. When there are more than two raters the previous methods cannot be applied directly without significant modification. We propose to use ANOVA approach for more general cases. The strength of ANOVA is that it can be used to determine both between- and within-rater measurement consistency. If we have information about how each rater measures the same MRI consistently, we can determine who is more consistent. This additional information can be used to train less consistent raters further.
Let Xijk be the k-th measurement on the j-th MRI by the i-th rater. Then, the two-way ANOVA model is given as
The usual measurement consistency between CC and RD can be determined by testing αCC=αRD. The interaction term (αβ)ij is used to determine the within-rater consistency for 10 MRIs. The within-rater consistency can be determined by simultaneously testing αβCC,1 = ··· = αβCC,10 = αβRD,1 = ··· = αβRD,10.
We can also visualize the within-rater consistency patterns using the box plot (Tukey, 1977). The Box plot is one of popular data visualization methods and is drawn in the following way (Martinez and Martinez, 2005). First, we obtain the value corresponding to 25%, 50%, and 75% of the sorted observations. They are called the lower quantile q1, the median q2 and the upper quantile q3 respectively. The median q2 provides the information about the center such that the half of the data is smaller than q2 and the other half is larger than q2. Then, we draw “the box” from q1 to q3 with the line of q2 within the box. This box provides the range containing 50% of the data around q2. Finally, we draw one line from q1 to q1 − 1.5(q3 − q1) and another line from q3 to q3+1.5(q3 − q1), which are called as “the whisker.” In box plot, the observations outside q1 − 1.5(q3 − q1) and q3+1.5(q3 − q1) are determined as potential outliers.
Let dj,k be the difference between k-th measurement of j-th MRI and the average measurements of j-th MRI by one fixed rater. The box plot of dj,k shows the diversity of measurements for each MRI. We can see how consistent each MRI is measured by a specific rater using the box plot of dj,k. We can visually compare within-rater consistency by comparing the box plots between CC and RD (Figure 4).
The linear regression fitting line for each head and neck structure appears as the dotted line in Figure 2. The measurements are more consistent when the dotted line is close to the solid line (y=x). Two lines were very close in HL, LFH, ATL, HVP and SP. In contrast, the dotted line was far from the solid line in VTL. The correlation coefficients of HL, LFH, ATL and HVP were 0.963, 0.987, 0.880 and 0.871 respectively (p-value < 0.001 in all cases). This implies the measurements are consistent for HL, LFH, ATL and HVP and this coincides with what we observe in Figure 2.
On the other hand, the correlation coefficient was 0.875 (p-value < 0.001) for VTL and this seems to contradict with Figure 2 because there was a clear systematic bias in VTL. We can infer from this that the correlation coefficient cannot detect the measurement inconsistency. Correlation coefficient of SP was 0.089 (p-value = 0.639). In spite of existing consistency between CC and RD, an outlier made the correlation coefficient close to 0. After removing the outlier, correlation coefficient of SP becomes 0.673 (p-value < 0.001). This implies that the correlation coefficient is very sensitive to outliers.
In summary, the correlation analysis has difficulty detecting the inconsistency between measurements. This is due to the fact that the correlation coefficient shows the degree of association not the degree of consistency. The correlation analysis is very sensitive to outliers. As a result, the correlation analysis is not appropriate as the measurement consistency analysis.
Figure 3 shows the Bland-Altman plots for each head and neck structures. Even though these plots provide the degree of bias, it is not easy to infer about the measurement consistency based on these plots. This is because the Bland-Altman method lacks statistical significance attached to the plot. Moreover, in measuring SP, one outlier severely increases the limit of agreement. In summary, Bland-Altman method is not appropriate as a technique for determining measurement consistency.
The paired t-test indicates that there is significant inconsistency in measuring LFH (p-value = 0.008) and HVP (p-value = 0.038) although the scatter plots of LFH and HVP in the Figure 2 show measurement consistency. This contradiction can happen if one rater systematically measures either larger or smaller than the other rater. When this systematic bias becomes larger than the measurement variance, this contradiction will happen.
In summary, the paired t-test can detect measurement bias between raters fairly well in most cases. However, it may fail when one rater systematically measure either larger or smaller than the other rater.
ANOVA results show that measurements are consistent between raters in measuring HL (p-value = 0.110), LFH (p-value = 0.517), ATL (p-value = 0.576), HVP (p-value = 0.937) and SP (p-value = 0.279) but not in measuring VTL (p-value = 0.029). This finding exactly coincides with what we found in Figure 2. The box plots in the Figure 4 and the interaction term in ANOVA show which rater performs better. RD is significantly more consistent than CC in measuring HL (the first row in the Figure 4; p-value < 0.001). CC is more consistent than RD in measuring LFH (the second row in the Figure 4) but the difference was not significant (p-value = 0.770). RD is significantly more consistent than CC in measuring ATL (the third row in the Figure 4; p-value = 0.008). CC is more consistent than RD in measuring HVP (the fourth row in the Figure 4) but the difference was not significant (p-value = 0.152). RD is significantly more consistent than CC in measuring VTL (the fifth row in the Figure 4; p-value = 0.016). CC is more consistent than RD in measuring SP (the sixth row in the Figure 4) but this difference was not significant (p-value = 0.115).
In summary, ANOVA extends the paired t-test method by considering the within-rater consistency. ANOVA analysis shows a good performance in detecting the measurement bias.
In this paper, we reviewed five techniques for determining measurement consistency of structures measured from head and neck MRI: the correlation analysis, the linear regression, the Bland-Altman method, the paired t-test and the ANOVA. We showed the strength and weakness of each technique in detecting the measurement bias and determining the robustness to outliers. Table 1 provides the summary of the strength and weakness of each technique.
A correlation analysis cannot detect the measurement inconsistency between raters and it is sensitive to outliers. So it is inappropriate to use the correlation analysis for determining measurement consistency. A linear regression should not be used either because it is equivalent to the correlation analysis.
It is not easy to make quantitative decision using the Bland-Altman method. This is mainly because the Bland-Altman plot does not have statistical significance attached to it. The paired t-test provides quantification for the Bland-Altman method and it shows a good performance in detecting measurement bias. However, when most of the measurements of one rater are consistently larger or smaller than the other rater, the paired t-test tends to fail.
ANOVA provides the best performance in all cases studied and showed accurate analysis results in determining the measurement consistency. In addition, it provides the additional information of within-rater consistency.
As suggested by Krummenauer and Doll (2000), a good rule to follow is not to limit measurement consistency assessment on only one method, but to apply and compare several methods. We also recommend making as many repeated measurements as time and cost permit for more accurate determination of measurement consistency.
This work was supported in part by NIH Research Grants R03 DC4362 (Anatomic Development of the Vocal Tract: MRI Procedures), and R01 DC6282 (MRI and CT Studies of the Developing Vocal Tract), from the National Institute of Deafness and other Communicative Disorders (NIDCD). Also, by a core grant P-30 HD03352 to the Waisman Center from the National Institute of Child Health and Human Development (NICHHD). We thank Celia Choih for assistance with placing the anatomic landmarks and making the necessary measurements.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.