|Home | About | Journals | Submit | Contact Us | Français|
Imaging and estimation of left ventricular function have major diagnostic and prognostic importance in patients with coronary artery disease. It is vital that the method used to estimate cardiac ejection fraction (EF) allows the observer to best perform this task. To measure task-based performance, one must clearly define the task in question, the observer performing the task, and the patient population being imaged. In this report, the task is to accurately and precisely measure cardiac EF, and the observers are human-assisted computer algorithms that analyze the images and estimate cardiac EF. It is very difficult to measure the performance of an observer by using clinical data because estimation tasks typically lack a gold standard. A solution to this “no-gold-standard” problem recently was proposed, called regression without truth (RWT).
Results of three different software packages used to analyze gated, cardiac, and nuclear medicine images, each of which uses a different algorithm to estimate a patient’s cardiac EF, are compared. The three methods are the Emory method, Quantitative Gated Single-Photon Emission Computed Tomographic method, and the Wackers-Liu Circumferential Quantification method. The same set of images is used as input to each of the three algorithms. Data were analyzed from the three different algorithms by using RWT to determine which produces the best estimates of cardiac EF in terms of accuracy and precision.
In performing this study, three different consistency checks were developed to ensure that the RWT method is working properly. The Emory method of estimating EF slightly outperformed the other two methods. In addition, the RWT method passed all three consistency checks, garnering confidence in the method and its application to clinical data.
The American Heart Association estimates that more than 600,000 people in the United States died of causes related to coronary artery disease in 2000 (1). Imaging and estimation of left ventricular function and volumes have major diagnostic and prognostic importance in patients with coronary artery disease (2–4). It was shown by Sharir et al (5) that the cardiac death rate increased exponentially as poststress cardiac ejection fraction (EF) decreased. Thus, small errors in estimates of cardiac EF may result in improper treatment because of this exponential relationship. It is vital that the imaging system used to estimate cardiac EF allow the observer to best perform this task.
Cardiac EF is an example of an estimation task in medical imaging. Many imaging modalities can be used to estimate cardiac EF, including nuclear medicine, ultrasound, x-ray, and magnetic resonance imaging systems. In addition, many different computer algorithms and models exist for analyzing images produced by each modality. Intermodality comparisons (comparisons of different imaging modalities) and interalgorithm comparisons (comparisons of different algorithms) are necessary to determine how to best estimate cardiac EF. We use task-based measures of image quality to compare imaging modalities and algorithms. To measure task-based performance, one must clearly define the task in question, the observer performing the task, and the patient population being imaged. In this report, the task is to accurately and precisely measure cardiac EF, and the observers are human-assisted computer algorithms that analyze the images and estimate cardiac EF. Finally, the patient population is at-risk patients who have had their cardiac EF estimated. The overall goal is to determine which method allows for the most accurate and precise measurement of a patient’s cardiac EF.
It is very difficult to measure the performance of an observer by using clinical data. This difficulty arises because estimation tasks typically lack a gold standard (5–9). For example, a patient’s true cardiac EF can never be known. However, the patient’s cardiac EF can be estimated by using multiple imaging techniques (2,10–17). Researchers historically have been forced to compare estimates that result from one imaging modality with results from a different, more accepted imaging modality (18–21). This comparison is tantamount to correlating the estimates of the two modalities together. This type of analysis shows nothing about how both estimates relate to the gold standard.
A solution to this problem, called the “no-gold-standard problem,” recently was proposed by Kupinski et al (22) and Hoppin et al (23), called regression without truth (RWT). The proposed solution to the no-gold-standard problem was shown and studied in simulation (22,23). We endeavored to make our simulation studies as realistic as possible. However, simulation studies always give rise to questions about real data. We performed phantom studies of the method and found that RWT performs well by using real image data taken from a phantom and using a volume-estimation task (24). In this report, we compare results of three different software packages used to analyze gated, cardiac, and nuclear medicine images, each of which has an algorithm to estimate a patient’s cardiac EF. The same set of images is used as input to each of the three algorithms. We analyze data from the three different algorithms by using RWT to determine which produces the best estimates of cardiac EF.
We briefly describe the three different computer algorithms and the RWT method and then present and discuss our results. We show that RWT returns values consistent with the data, thus further validating the use of RWT. Questions still exist about the usefulness of RWT, which we address further in the Discussion section.
Left ventricular EF (LVEF) was estimated in 122 consecutive patients who underwent single-photon emission computed tomographic (SPECT) perfusion imaging immediately after either exercise or pharmacological stress (25). Perfusion imaging was performed by using 148 MBq (4 mCi) of thallium 201 and performed on a Philips Medical Systems Prism 3000XP SPECT (Amsterdam, The Netherlands) system equipped with low-energy high-resolution parallel-hole collimators on each of its three camera heads. Acquisitions were acquired as 64 × 64 frames with a pixel size of 6.34 mm at 3° intervals over 120° of rotation for each head. Acquisition time was 26 seconds at each projection angle, and this time was divided into eight intervals during the cardiac cycle by using electrocardiogram gating with tracking of the average gate interval from frame to frame. Projections were two-dimensionally prereconstruction filtered with a Butterworth filter with an order of 5.0 and a cutoff value of 0.2 of the sampling frequency. They were reconstructed by filtered backprojection by using the standard software of the Philips Medical Systems Odyssey Workstation to which the SPECT system was attached.
LVEF was obtained in automatic mode with no operator adjustment by using three commercially available programs. The first was the Quantitative Gated SPECT (QGS) program from Cedars-Sinai (26,27). With QGS, the location of center-of-mass left ventricular counts is detected automatically, radial sampling is used to fit an ellipsoid to the midmyocardial surface, then a cost function is used to determine the endocardial and epicardial surface points associated with the ellipsoid. The enclosed endocardial volumes at end-diastole and end-systole are used to calculate LVEF. The second program used to calculate LVEF was the Emory Tool Box (Emory) (13,28). The Emory program starts by automatically determining in short-axis slices the apex, basal valve plane, and long axis of the left ventricle. A cylindrical coordinate system is used to find the locations on the center of the heart wall in the basal and midventricular regions, and a spherical coordinate system is used to find the center of the wall in the apical region. A 10-mm myocardial thickness is assumed at end-diastole, with one half of this on either side of the center locations. Wall thickening is determined during the cardiac cycle, and epicardial and endocardial surfaces are determined. LVEF is calculated again from end-diastolic and end-systolic volumes. The third program used to calculate LVEF was the Wackers-Liu Circumferential Quantification (WLCQ) from Yale (29,30). This program determines the location of maximum counts in circumferential count profiles for a series of short-axis slices containing the left ventricle. Then the location of the endocardial surface within this region is determined, the apical region is modeled as a semiellipsoid for which parameters are determined from eight central horizontal long-axis slices, and LVEF is calculated from the enclosed end-diastolic and end-systolic volumes.
The RWT method is described in detail in other publications (22,23). We present the method briefly for completeness. The RWT method assumes a parametric relationship between the gold standard EF Θp for patient p and the estimate of that patient’s cardiac EF denoted by θpm for the pth patient and mth method of estimation. Although this relationship may be arbitrarily complicated, to date, we have assumed a linear relationship between the gold standard and estimates through the equation:
where εpm is a zero-mean-noise term characterized by its SD σm, and the parameters am, bm, and σm compose the linear model parameters that describe how the mth method’s estimates relate to the true gold standard. The same set of patients had their cardiac EFs measured by means of all methods. In general, one would not expect a highly nonlinear relationship between the gold standard and the estimation method unless the estimation method produced purely senseless results. Assuming a linear relationship is tantamount to claiming that the method requires some simple calibration to match, on average, the gold standard. We previously showed that a nonlinear relationship can be used with RWT (22).
We assume that the noise term εpm is distributed as a zero mean Gaussian with an SD of σm that is a characteristic of the method of estimation. Other noise models can be accounted for easily by using RWT. We must further assume that the gold standard Θp is sampled from a distribution for which the shape is characterized by the vector of parameters . We previously studied two different types of parameterized distribution: truncated-normal distribution and β distribution (23). The truncated-normal distribution is a Gaussian distribution with a mean μθ and a variance σΘ (i.e. = [μΘ σΘ]) that is truncated at less than zero and greater than one. This truncation allows this distribution to represent cardiac EFs that are bound between zero and one. The β distribution (Appendix) also is bound between zero and one, but unlike truncated-normal distribution, can have zero probability density at zero and one for certain values of .
Truncated-normal distribution has the unfortunate property that there is always a non-zero-probability density near a cardiac EF of one and zero (and hence a nonzero probability of obtaining an EF in a differential interval around both zero and one). Both these situations should have a zero-probability density. Never will a patient have a cardiac EF close to one or close to zero. Based on this information and the work of Sharir et al (5) showing distributions of estimated cardiac EFs, we postulate that the β distribution is the more appropriate distribution for our study.
Using the linear model and one of the two distributions characterizing the gold standard, we are able to form a likelihood expression (22) that can be used to perform maximum-likelihood estimation. Maximum-likelihood estimation will return estimates of am, bm, and σm for all methods m, as well as an estimate of that characterizes how the gold standard is distributed. Parameters am and bm can be used to “adjust” the estimates θpm to be closer, on average, to the gold standard. Furthermore, the term σm/am characterizes the reproducibility (refer to equation 1) of the method and provides a figure of merit used for comparing the different methods. One can argue that the method that produces the most reproducible estimates after being adjusted to be unbiased is the best method.
In our previous works (22–24), we were able to compare our estimates of linear model parameters with true linear model parameters because we had the gold standard available to us, either through simulation or from a phantom study. For this study, we are forced to address our lack of a gold standard, as would any researcher using RWT on real data. Therefore, we cannot compare our estimates with the truth because the truth is not available to us. Instead, we are limited to performing consistency checks on the estimates returned by RWT. These consistency checks will not guarantee that the method is working properly. However, if the method does not pass these consistency checks, we know the RWT method is not working properly. Therefore, although we cannot state with certainty that RWT is working properly, we can run numerous studies to ascertain whether the method is not working properly. We run three different consistency checks, labeled consistency check 1 (CC1), CC2, and CC3, described in detail next.
The RWT method returns information relating the gold standard to estimates returned by a method (equation 1). However, two different estimation methods share the same gold standard because the same patient population is used in both. Thus, we can use results returned by RWT to represent the relationship between two estimates. Mathematically, this relationship is determined by solving for the gold standard in equation 1 for each of the estimates θpi and θpj, resulting in:
If the parameters returned by the RWT method accurately relate the gold standard to estimates returned by the various methods, the slopes and intercepts determined by equation 2 should accurately relate the two estimates to each other. The converse of the previous statement is not necessarily true. However, the contrapositive of this statement is true and allows us to state that if the relationship between the estimates is not determined accurately, the relationship between the gold standard and the estimates also is not accurately determined by means of RWT.
Another consistency check we can perform involves comparing the distribution of the gold standard returned by means of RWT pr(Θ| ) to the histogram of the raw data θpm for a given method m calibrated to match the gold standard. Specifically, we compare the gold standard density to a histogram of:
for all patients p and for a given method m. These data should not match exactly because equation 3 has extra noise added onto it from the εpm term. However, if the variance of εpm is small for method m, the histograms should match the gold-standard density returned by means of RWT. As we show next, all variance terms for the three methods we are evaluating are small.
A final consistency check we developed tests the sample covariance of the EF estimates returned by means of the various methods to what is predicted by means of the model used (equation 1). This test incorporates two very important aspects of the RWT model; the first is the linear relationship we assume, and the second is the independence of the noise in estimates returned by means of different modalities. Using these two aspects of our RWT model, it can be shown that:
where θi and θj are the estimates of EF from the different methods, âi and âj are the estimates of the slopes returned by means of RWT, Var(Θ) is the variance of the gold-standard parameter, and Cov(θi, θj) is the covariance of the estimates returned by means of method i and the estimates returned by means of method j. The expectations needed to compute the covariance and variance in equation 4 are expectations taken over the population of patients; hence, the single subscript denoting the method on the estimates θi and the lack of a subscript on the gold standard θ. The left side of equation 4 can be estimated by using the sample covariance and data given. However, the right side of equation 4 is determined completely by the slopes returned by means of RWT, and the determines the variance on the gold standard. Hence, we are relating a data measure to a value determined completely by means of RWT and using the assumptions made by means of RWT. Again, this check can only tell us if the method is not working or our assumptions are invalid; it cannot inform us if the method is working properly.
Recall that the RWT method assumes a linear relationship between the gold standard and the estimates returned by each method. The RWT method also uses a parametric distribution to characterize the randomness in patients’ cardiac EFs. We study two different types of parameterized distribution: truncated-normal distribution and β distribution (23). We performed our RWT analysis on the data set of 85 patients twice; once by using the truncated-normal distribution and once by using the β distribution. Table 1 lists results returned by means of RWT for both studies.
CC1 checks that the linear model parameters returned by means of RWT are consistent with comparisons made between the estimates from two different methods. This analysis is particularly interesting because researchers often perform regression analysis when developing a new technique. Here, we did not perform regression analysis; we determined the relationship among the three estimation methods and the gold standard and used that information to determine the relationship among the various estimation methods. Figure 1 shows these results for β distribution. Results for truncated-normal distribution are nearly identical (ie, the RWT-fit lines are indistinguishable from those shown in Figure 1). Thus, both methods return estimates of linear model parameters that are consistent with CC1, although the linear model parameters determined by means of the two methods are different. Table 2 lists these results and the slopes and intercepts returned by using conventional regression analysis. One would not expect results obtained by using RWT to match exactly results obtained by using conventional regression analysis; however, we found that results were close for both the β and truncated-normal distributions.
CC2 checks the distribution of the gold standard determined by both the model used (β or truncated-normal) and parameters returned by means of RWT that characterize the shape of these distributions. CC1 validated the estimated linear model parameters. This consistency check attempts to validate the estimated parameters characterizing the gold-standard distribution. Figures 2 and and33 show the gold-standard densities using the estimated parameters and the histograms of the raw data adjusted by using equation 3 for the truncated-normal distribution and β distribution, respectively. It is clear from Figure 2 that the method returns a density that has a substantial non–zero-probability density at a cardiac EF of zero and one. Thus, there is a finite probability of obtaining an ejection fraction close to zero or one. Both these cases clearly are not possible. Although the densities and histograms appear to be consistent, the resultant distributions do not agree with our prior knowledge of the range of cardiac EF values. Using β distribution (Figure 3), the method fares much better. Ranges on the density by using β distribution are consistent with what measured cardiac EFs should be. Thus, the normal distribution, although it seems to pass CC2, does not agree with our prior knowledge of the range of EF values.
The final consistency check (CC3) uses both the estimated linear model parameters and estimated parameters characterizing the gold standard to compare theoretically determined covariances (equation 4) with covariances measured from the data. Table 3 lists results of CC3. It is clear that the estimates returned from the β distribution very closely match the measured covariances determined by the data. The differences are in the third significant figure. Truncated-normal values for the covariances are close, but not as close as values returned when using β distribution. Again, when we used β distribution, the method clearly passes CC3. Results are less clear for truncated-normal distribution.
The RWT method is meant to be used on data for which a gold standard is not available. However, to date, we performed simulation studies and phantom studies for which the gold standard was available to us. Because the gold standard is unavailable in practice, one can never be completely ensured that RWT is working properly. However, in this report, we developed three consistency checks that can be performed on RWT results to garner confidence in the results of RWT. These three consistency checks analyze the linear model parameters returned by means of RWT, as well as parameters characterizing the gold-standard distribution.
Using cardiac EF data derived from gated SPECT studies, we applied RWT and performed our three consistency checks. Using β distribution for the gold standard, we found that results of RWT passed all three checks. Furthermore, we determined that all three methods (Emory, QGS, and WLCQ) have similar performances, although the Emory method appears to slightly outperform the other two methods.
The reader may have noticed a large discrepancy between slopes and intercepts returned by means of RWT using β and truncated-normal as assumed distributions. RWT using β distribution performed better in consistency checks than RWT using truncated-normal distribution. However, it is difficult to determine which model is more appropriate for these data by using results of the consistency checks. This question of model comparison may be addressed by using the maximum value of the likelihood function for each assumed distribution. The model with the higher likelihood is the more appropriate model. This procedure is analogous to using maximum-likelihood theory to choose the model. The log-likelihood value for RWT by using β distribution was −229 when the log-likelihood was −279 for RWT by using truncated-normal distribution. This would imply that β distribution is more appropriate for these data, which we suspected from the consistency check results. Although this analysis may indicate that the linear model parameters returned by means of RWT are more accurate with β distribution, it should be noted that the rank order of the systems remained the same using either model. We do not suggest that one use this model-comparison technique for a large number of possible models with different numbers of parameters. This type of analysis could return misleading results.
The three consistency checks presented cannot conclusively validate the results of RWT; they can conclusively invalidate these results if the method has not performed well or the model used is inappropriate. RWT is a method that directly addresses the no-gold-standard problem, although without a gold standard, it is difficult to validate results of RWT. The consistency checks we developed enable researchers to garner confidence in the results of RWT.
Supported under National Institutes of Health/National Institute of Biomedical Imaging and Bioengineering grant no. R01-EB002146.
The probability density function for β distribution has the form: