The study protocol was in compliance with the Declaration of Helsinki, and institutional review board approval was prospectively obtained at the University of California, San Francisco, and Chiang Mai University.
This study was conducted using fundus photographs acquired from 94 patients with HIV at the Ocular Infectious Diseases Clinic at Maharaj Nakorn Chiang Mai Hospital in Chiang Mai, Thailand from August 2008 through April 2009. All patients presenting for their initial visit or diagnosed with CMV retinitis in the previous month were offered enrollment in the study. The image capture protocol was analogous to the fundus photography protocol used in the Longitudinal Study of the Ocular Complications of AIDS.12
A trained ophthalmic photographer used a digital fundus camera (TRC-NW 6S; Topcon, Oakland, NJ) to capture overlapping 45° fundus photographs in each eye: one central image of the optic disc and macula, with eight or more surrounding midperipheral images. Each eye was photographed only once. A mean of nine fundus photographs were taken for each eye (range, 8–15).
In this ancillary study, we included 89 eyes with a definite clinical diagnosis of CMV retinitis, 5 eyes with an uncertain diagnosis of CMV, and 5 eyes with a diagnosis from photographic review only. One of the authors (JC) loaded all photographs for each study eye onto each of three commercially available automontage programs: OIS AutoMontage (OIS, Sacramento, CA), i2k Retina (DualAlign LLC, Clifton Park, NY), and IMAGEnet Professional (Topcon, Oakland, NJ). The authors used the factory default settings for each program, without specialized image registration settings. Although we did not use any of the advanced options in any of the programs, it should be noted that all programs have the capability to accommodate as many individual photographs as necessary; all have the option to view images as blended or unblended mosaics, with advanced blending options available in i2k Retina and OIS AutoMontage, and IMAGEnet allows for manual construction of a mosaic. One of the authors (JC) created all the mosaics, documenting whether at least two individual photographs were stitched together and also whether all the individual photographs in the set had been incorporated into the mosaic. The completed mosaics were presented to three expert graders (DH, TPM, and JDK) for two rounds of grading. First, the mosaics were presented as randomly sorted sets of three mosaics per study eye (one mosaic from each software program), to facilitate a side-by-side comparison. Second, the mosaics were presented in random order (without stratifying by study eye), with each mosaic paired with the individual central photograph used in constructing the mosaic. The graders were masked to the identity of the software.
In the first round of grading (i.e., side-by-side comparison of the three fundus montages from the same study eye), the three graders assigned each image a rank based on personal preference (1, best; 3, worst). Images that could not be stitched into a mosaic by one program, but could be by another, were automatically given the lowest ranking. In the second round of grading, the same three graders assessed whether each mosaic was gradable or ungradable. Images that were considered to be ungradable were not assessed further. For gradable mosaics, images were assessed for the presence or absence of the following five photographic features: duplicated vessels (duplication of the same blood vessel from a common origin), breakage in vessel continuity (sudden sharp disruption of a single, nonduplicated blood vessel that is not due to image misplacement), misplaced images (placement of one or more of the individual photographs in an incorrect sector of the montage), relative blurriness of the mosaic image (compared to the individual central photograph), and acceptable versus unacceptable overall quality of the mosaic (acceptable mosaics were defined as the presence of few or no mosaic errors and the preference for using the mosaic as opposed to the individual images for clinical care). The graders were also invited to make comments about individual mosaics. They repeated this exercise on a set of 50 randomly selected duplicate mosaics approximately 2 weeks after the initial grading, to assess intrarater reliability. All graders viewed images for the study on a glossy monitor manufactured by Apple, Inc. (Cupertino, CA).
We performed descriptive statistics for each software program separately for each grader. The proportion of mosaic images assigned a rank of 1 and the proportion assigned a rank of 3 were computed for each program. To assess whether the ranks differed between the three software programs (using all three raters for each eye), we computed a simple rank centroid statistic, as described in Appendix A. We further examined whether the differences in rank were statistically significant in pairwise comparisons of the three programs.
Intrarater reliability was assessed using Cohen's κ on two separate measurements from each rater (with 95% CI computed using the bootstrap percentile method),13
except when the estimate was 1, in which case we used the exact method to compute a lower, 97.5%, boundary.14
Interrater reliability between all three raters was assessed with the Fleiss κ for multiple raters (with 95% CI computed using the bootstrap percentile method).13
The κ coefficients were interpreted according to the following categories: κ < 0.4 corresponds to poor agreement, κ from 0.4 to 0.75 corresponds to fair to good agreement, and κ >0.75 corresponds to excellent agreement.15
We used clustered logistic regression to model the probability of the presence of each feature, accounting for the correlation between images captured from the same person and also the correlation between images graded by the same grader. We used a likelihood ratio to assess whether there was a significant difference in the frequency of each photographic feature between the three programs. When this overall hypothesis test was correct, we further examined pairwise contrasts.
We then assessed whether the presence or absence of each of the photographic features was associated with the programs' ranking. Since each grader also determined the presence or absence of each feature of interest, we calculated the correlation coefficient for each eye–grader combination. Each estimated correlation coefficient summarizes the degree of association between the presence or absence of each feature and the rankings for the three software programs (for that eye and grader). In the absence of any association, the expected average of these correlations is 0. We tested the hypothesis that these were 0 by estimating the intercept in a linear mixed effects regression model with eye and grader as random effects.
To determine whether missing individual images (i.e., the full set of individual images was not incorporated into the mosaic) was predictive of the ranks by each grader, we computed the rank centroid for images with the missing indicator and the rank centroid for those without and computed the distance between these two centroids (see Appendix A for details). Under the null hypothesis, this difference of distances should be 0 on average; the sampling distribution of this difference was computed using the permutation method. We then computed Spearman's rank correlation coefficient between missing individual images and the mean rank of each photo. Statistical analysis was performed with R 2.12 (University of Auckland, Auckland, New Zealand). We accounted for multiple comparisons by setting the significance level at α = 0.002, which assumed a Bonferroni correction with a family-wise error rate of α = 0.05 and 25 comparisons.