|Home | About | Journals | Submit | Contact Us | Français|
Automated identification of diabetic retinopathy (DR), the primary cause of blindness and visual loss for those aged 18–65 years, from color images of the retina has enormous potential to increase the quality, cost–effectiveness and accessibility of preventative care for people with diabetes. Through advanced image analysis techniques, retinal images are analyzed for abnormalities that define and correlate with the severity of DR. Translating automated DR detection into clinical practice will require surmounting scientific and nonscientific barriers. Scientific concerns, such as DR detection limits compared with human experts, can be studied and measured. Ethical, legal and political issues can be addressed, but are difficult or impossible to measure. The primary objective of this review is to survey the methods, potential benefits and limitations of automated detection in order to better manage translation into clinical practice, based on extensive experience with the systems we have developed.
There were 23.6 million people with diabetes in the USA in 2007, and diabetes caused approximately 24,000 cases of blindness that year . Prevalence of diabetes has rapidly risen to 7.8% of the population on average; 10.4% of Hispanic Americans and 11.8% of blacks over the age of 20 years. Blindness and visual loss can be prevented through early detection and timely management. There is widespread consensus that regular early detection of diabetic retinopathy (DR), or screening, is required and is cost effective in people with diabetes [2–5]. Almost 50% of people with diabetes in the USA currently do not undergo any form of regular documented dilated eye examination, despite guidelines published by the American Diabetes Association, the American Academy of Ophthalmology and the American Optometric Association . In the UK, a smaller proportion, approximately 20–30%, of those with diabetes are not regularly evaluated, despite an aggressive effort to increase screening for people with diabetes. Digital imaging and expert reading has been shown to be comparable or superior to an office visit for assessing DR [7,8], and has been proposed as an approach to make the dilated eye examination available to un- and under-served populations that do not receive regular examinations from eye-care providers. If all of these underserved populations were to be provided with digital imaging, in the USA the annual number of retinal images requiring evaluation would be 32 million (~40% of people with diabetes and at least two photographs per eye) [8,9].
The current challenge is to make early detection more accessible by reducing the cost and manpower required, while maintaining or improving DR detection quality. This challenge can be met by utilizing computer-assisted or automated detection of DR in retinal images.
When translating automated DR detection into clinical practice, both scientific and nonscientific issues arise. The scientific issues can be studied and measured: the diagnostic performance compared with human experts and the amount of time the computer needs to obtain a diagnostic result. Ethical and political issues also need to be addressed, but these are difficult or impossible to measure. For example, is it politically and emotionally acceptable to diagnose a patient based upon their retinal images when those images are never seen by a physician? Societal and physician acceptance of automated DR screening will encourage us to consider retinal DR screening as we might a laboratory test, with acceptable high-detection and low false-positive rates. To fully understand this kind of issue, an understanding of the potential benefits and limitations of automated detection are required, which is the primary objective of this review.
Terminology matters greatly to those with diabetes. The term ‘people with diabetes’ is the accepted term for those who have been diagnosed with diabetes, and is thus appropriate also for those who are being examined for DR. For clarity, the term ‘patient’ is used interchangeably with people with diabetes in this review. The expression ‘automated detection’ within this review implies that the images of at least some patients are never examined by a human expert, while ‘computer-aided diagnosis’ is reserved for instances when all images of all patients are examined or reviewed by an expert.
Most early detection programs or screening programs use non-mydriatic digital color fundus cameras to acquire color photographs of the back of the eye, the retina. These photographs are then examined by human experts for the presence of specific lesions that are indicative of DR. The lesions primarily include microaneurysms, hemorrhages, exudates and cotton-wool spots, as well as blood vessel caliber changes. Other, non-DR lesions can also be detected, including nevi and melanomas. If such abnormalities are found, typically in less than 10% of patients that are imaged [9–11], the patient is then referred to an eye-care provider (usually an ophthalmologist or retinal specialist) for further diagnosis and, possibly, treatment. The 10% is typical in most established early detection programs that have a steady influx of people with newly discovered diabetes, and out of which arises the initial backlog of patients in which DR has been referred for specialist management.
In theory, this system works well, but the weak link is the human expert reading the photographs, who are expensive to train and difficult to retain given the type of work – evaluating hundreds if not thousands of photographs every day, most of which have no abnormalities, and given intra- and inter-observer variability, necessitating frequent and costly retraining and recertification. In addition, the delay between the patient being imaged and the result of the reading can be days, thereby making it impossible to inform the patient about the test result while they are still available at the point of service.
Thus, using a computer to analyze the retinal images for abnormalities, and then either assisting the human expert reader – a process called computer-aided diagnosis or detection – or allowing the system to autonomously decide whether the images contain abnormalities – a process called automated reading – is a logical development.
Such systems can be placed either close to or in the camera, that is at the point of service, or be accessible online, where images are submitted to a central server for reading – the first allowing the patient to be informed of the test result immediately, the last allowing greater oversight of the quality of the detection process.
Approximately 90% of all patients in DR screening programs do not have DR, and only 10% or less will be discovered to have DR, the natural incidence of DR in people with diabetes. Given this distribution, a fully automated system will have to decide which of the patients are not suspect for the disease, without at least some of them also being seen by a human expert. A computer-aided system, on the contrary, will indicate to the human expert which images or image regions contain abnormalities, and thus requires the human expert to evaluate each image for each patient. Thus, for the anticipated use of an automated system, a high sensitivity, not missing patients who are suspect for the disease, is a safety issue, and is more important than a high specificity, which is an efficiency issue. Thus, great demands are placed on such a system and extensive testing is required to ensure its safety . We will now provide an overview of the historical development and state of the art of these algorithms.
There is some debate regarding what level of DR needs prompt medical examination, and requirements for different referral criteria have been advocated through publication . Many early-detection programs in the USA and Europe, including in the UK and at the United States Veterans Affairs Health System, are so-called American Telemedicine Association category 2 systems, which require referral at any level of diabetic macular edema, and/or at least mild nonproliferative DR (i.e., Early Treatment Diabetic Retinopathy Study [ETDRS] level ≥35), in either eye . Other authors propose referral for any form of DR (e.g., a single or few microaneurysms [MAs]), that is, an ETDRS level of 20 or more . However, MAs can occur in the absence of diabetes: in 9040 nondiabetic subjects in the Atherosclerosis Risk in Communities Study, more than 6.5% were found to have minimal signs of retinopathy . While subjects with at least three MAs did usually have diabetes, those with fewer than three did not [Hubbard L, Pers. Comm.] . The Diabetes Control and Complications Trial (DCCT) performed a secondary analysis, defining onset of DR as presence of three MAs rather than simply one MA, in order to achieve a more robust definition of DR onset (i.e., less likely to show later regression) .
In 1984, Baudoin et al. were the first to describe an image analysis method for detecting MAs, one of the most characteristic lesions in DR . They applied their analysis method to angiographic images, not color images that are widely available, noninvasive and relatively inexpensive. Using angiographic images, because of concerns about safety and cost–effectiveness, would prohibit the use of cost–effective large-scale screening. The work of Baudoin et al. was based on converting individual angiogram frames into digital images of the retina and then detecting MAs using a ‘top-hat’ transform, a step-type of digital image filter. The field changed dramatically in the 1990s with the development of digital retinal imaging and the expansion of digital filter-based image analysis techniques. These developments resulted in a rapidly increasing number of publications, which is continuing to expand. Since then, many approaches use the following principle: an image transform of some kind is used for detecting candidate lesions after which a mathematical morphology template is used to characterize candidates . Additional enhancements were made, which include the contributions of Spencer, Cree, Frame and coworkers [19,20]. They added preprocessing steps, such as shade-correction, and matched filter postprocessing to this basic framework to improve performance. Algorithms of this kind function by detecting candidate MAs of various shapes, based on their response to specific image filters. A supervised classifier is typically developed to separate the valid MAs from spurious or false responses. However, these algorithms were originally still designed to detect the high-contrast signatures of MAs present on fluorescein angiogram images.
The next important development resulted from applying a modified version of the top-hat algorithm to red-free fundus photographs rather than angiogram images, as was first described by Hipwell et al. . They tested their algorithm on a large set of over 3500 images and found a sensitivity/specificity point of 0.85/0.76. Once this step had been taken, development accelerated. The approach was further refined by broadening the single filter top-hat transform, originally developed by Baudoin to detect candidate pixels, to a multifilter filter-bank approach [22–24]. The filter responses are used to identify pixel candidates using a classification scheme, and then mathematical morphology and another classification step are applied to candidates to decide whether they were valid MAs and hemorrhages. A very similar approach was also successful in detecting other types of DR lesion, including exudates or cotton-wool spots, as well as drusen, a lesion associated with age-related macular degeneration .
Additional refinements to this framework of candidate detection and classification of candidates that are continuing to be improved include the explicit avoidance of false-positives near, or on, tortuous vessels . Another example is a wavelet-based method, whereby a multiband wavelet transform (similar to a filter-bank transform) was used to detect candidates. Within the wavelet transform space, candidates are compared with a similarly transformed MA template to classify whether they are valid lesions . Thus, instead of learning a model of the lesions through training, this approach uses an expert-designed model of the lesion as the detector. Advances are continuing to be made, especially in the design of optimal filters for detecting MAs .
A different approach, known as region-growing segmentation, was applied by Sinthanayothin et al. [29,30]. Region-growing is an approach to segmentation of image regions in which pixels are examined and added to a region class if no edges are detected. Rather than using a candidate-detection transform followed by classification, Sinthanayothin et al. applied a regiongrowing algorithm to segment retinal blood vessels and red lesions simultaneously. Vessel regions were then classified using a neural network-based classifier, and the remaining pixels were characterized as MAs.
As mentioned previously, high demands must be placed on a DR detection system that makes ‘decisions’ that have potentially vision-threatening consequences. Thus, some confidence in the agreement between the system and expert human readers is expected. The agreement between automatic system and expert reader may be impaired owing to the algorithm, or to the protocol or camera used to generate the fundus images. For example, an imaging protocol that does not allow small lesions to be seen or detected, which might be detected with an improved camera or better imaging protocol (see later), will lead to an overestimation of the performance of the system. The system appears improved if human experts and the algorithm both overlook lesions, as it increases agreement between them.
The performance of a DR detection system can be measured as its sensitivity, a number between 0 and 1, which is the number of true-positives divided by the sum of the total number of false-negatives (incorrectly missed) and true-positives, and its specificity, also between 0 and 1, which is the number of true-negatives divided by the sum of the total number of false-positives (incorrectly thought to have disease) and true-negatives. Sensitivity and specificity both require a ‘reference standard’, which is a discrete value (0 or 1) for each patient expressing the estimate of what is ‘truly’ the matter with that patient. A reference standard can come, for example, from evaluation by a reading center.
The output of an algorithm, its estimate of what is the matter with a patient, can be a discrete number (0 or 1). However, the output is often a continuous likelihood p-value between 0 and 1, and an algorithm can be made more specific or more sensitive by setting a threshold on this p-value.
The AUC for the receiver operator characteristics (ROC) curve for an algorithm is determined by setting, for example, 100 different thresholds for p-value, and obtaining sensitivity and specificity pairs of the algorithm at each of these thresholds. The reference standard is kept constant. The resulting sensitivity/specificity pairs are plotted in a graph, resulting in a ROC curve. This curve can be reduced to a single number, the surface area under that curve or AUC. The maximum AUC is 1, denoting a perfect diagnostic procedure, with sensitivity and specificity being 1 (100%).
Although the AUC would seem to cover the most important aspect of performance, we have argued that other measures are also important .
The DR detection algorithms described above are so-called machine learning or supervised algorithms – meaning that they have been trained to detect a specific type of abnormality, and most commonly MAs. This approach works well for common DR lesions. In fact, if other, less common lesions, such as neovascularization (NV; a sign of advanced retinopathy), occur, it usually appears together with the more common lesions such as MA. Less common lesions occurring in isolation are extremely rare, occurring in less than 1:1000 patients. However, such rare lesions will not occur in typical training sets with hundreds or thousands of images. NV and choroidal melanoma are excellent examples: their prevalence is low, less than 1:200,000 in the case of choroidal melanomas, and the prevalence of (macular) melanomas, visible in otherwise unenriched DR images, is lower still. However, missing either of these lesions may mean a missed diagnosis of a potentially blinding or fatal disease. If the only metric used to measure an algorithm is the AUC or sensitivity and specificity, the risk of applying algorithms to large populations without addressing missed diagnosis of rare high-morbidity or -mortality lesions becomes significant.
Now that the feasibility of detecting DR lesions in retinal images has been demonstrated and different approaches have been developed, the next advance is to determine how these different approaches rank in terms of performance. One limitation is that none of the methods described in this review were compared on the same dataset and the measures of performance reported were often different, making direct comparisons between various methods difficult to impossible. Recently, two large studies of automated screening systems have appeared in the literature [12,31–34]. The systems evaluated were designed to perform DR screening in an automated fashion. Each system stores those examinations that, after analysis, appear to have no visible signs of the presence of diabetic retinopathy at some level of severity. Only those examinations possibly containing diabetic retinopathy-related lesions, or that have lesion-detection-limiting image quality, are evaluated by a human expert. In this manner, these automated systems can reduce the workload associated with large-scale screening. However, they have not been compared on the same dataset using the same metric.
For years, many researchers and groups have been working on improving DR algorithms, often in collaboration. The first author of this review has been leading several group efforts, and we are familiar with the work of many others. Of necessity, we had to limit our discussion of the state of the art of such systems to specific examples, but many others have made significant contributions or achieved similar results.
The Iowa system is a collection of algorithms, some of which were partially developed in other centers, including Utrecht, The Netherlands and Brest, France. In their current form, they were tested as an assembled system at the University of Iowa (IA, USA) [22,24,26,35,36]. Central to the Iowa system is the availability of retinal images from tens of thousands of patients, from a number of diabetic retinopathy screening programs around the world. Owing to this unique resource, we can evaluate new or existing algorithms on large datasets, on representative data – that is, images obtained by different camera operators, from different camera types, from patients with different ethnic and racial backgrounds. All were obtained from populations with people with diabetes who were thought not to have diabetic retinopathy, and where the incidence of diabetic retinopathy was less than 15% .
Some algorithms only detect MA and forego a phase of detecting normal retinal structures – the optic disc, fovea and retinal vessels – which can act as confounders for abnormal lesions . However, as described previously, most systems use a series of steps whereby initially these normal structures are identified, and then abnormal areas are identified and characterized. We have refined a set of algorithms that are capable of detecting specific retinal structures and abnormalities (Figure 1). Our studies on small numbers of patients showed that such algorithms can perform comparably to retinal specialists. The images were all derived from actual screening datasets – that is, datasets in which less than 15% of patients have DR [9–11]. The performance of these component algorithms was:
Initially, the retinal part of the entire image is detected, and the nonretina parts are discarded in a preprocessing step. After this step, the image is analyzed at the pixel level and information about local pixel features are extracted. Following this, the image quality is determined and stored. The normal anatomy is detected – that is, the vessels, the optic disc and foveal location. The system then detects possible image pixels that could be part of an abnormality. The detected pixels are grouped into candidate lesions and each of these is extensively analyzed. Based on the outcome of the analysis, the system determines for each candidate lesion the probability it is a true red or bright lesion. The probabilities of each lesion for each image are then combined into a patient-level probability of that patient having DR or low-quality photographs, using a classification step that we call fusion, using image-level features, such as the number of red lesions per image. An example for a single examination is given in Figure 2.
The examinations in the datasets on which we have tested the system have been analyzed only once by a single retinal specialist, which is less than optimal. However, asking retinal specialists or a reading center to re-evaluate 40,000 or 60,000 images in masked fashion is almost prohibitive. To address this problem, we have obtained a random sample of 500 unique patients from our previous study , had them reread in masked fashion by three retinal specialists and determined the sensitivity and specificity of each physician compared with the original reading. The results show that the performance of the algorithms measured in ROC only slightly lags behind the sensitivity and specificity of the experts. In fact, the system can always match the sensitivity of the human expert – but at lower specificity (see Figure 3 [dotted line]). We have not yet performed a cost–effectiveness study, but others have shown that this level of performance is cost effective [42,43].
To determine how well DR detection systems detect DR compared with retinal specialists, the performance of such a system (‘DC 2008’) on 10,000 examinations from 5692 unique patients with diabetes (two images per eye, centered on the fovea and disc, respectively) was investigated, comparing the system to a single reader who had previously classified all examinations as ‘DR suspect’, ‘DR not suspect’ or ‘insufficient quality’. The examinations were collected from patients with diabetes from a true DR screening program, at ten different clinics with four different types of retinal nonmydriatic camera. These patients had not previously been diagnosed with DR and had not received a dilated retinal examination less than 1 year before inclusion. The system achieved an AUC of 0.85 on the first visit, at an optimal sensitivity of 0.84 and specificity of 0.64. That the system is relatively stable is shown by the fact that the AUC for the second visit was 0.84. At this point, 7645 out of 10000 examinations (76%) had acceptable image quality, 4648 out of 7645 (61%) were true-negatives, 59 out of 7645 (0.7%) false-negatives, 357 out of 7645 (5%) true-positives and 2581 out of 7645 (34%) false-positives. In total, 27% of false-negatives contained large hemorrhages and/or NV – the number of false-negatives was related to the number of lesions per image: the fewer lesions, the more probability of the system detecting a false-negative .
Exhaustive testing of an algorithm leads to the risk of ‘overfitting’ – that is, the system has optimal performance on the test dataset or similar datasets, but fails if tested on datasets that are different in some aspect, for example, the incidence of DR in the population, the race or the ethnicity distribution of the population. Being conscious of this risk means to ensure that the datasets are as diverse as possible in terms of prevalence of DR, image quality, image resolution, race and ethnical background. As an example, although publicly available datasets are rare, we evaluated this on the only third-party accessible dataset, Méthodes d’Evaluation de Systèmes de Segmentation et d’Indexation Dédiées à l’Ophtalmologie Rétinienne (MESSIDOR), which is a clinical, not a screening dataset . The ARVO 2008 version of the algorithm on 400 images gave an AUC of 0.89 in separating the ‘no or minimal retinopathy’ (classes 0 and 1) from the ‘suspect retinopathy’ (classes 2, 3 and 4).
We hypothesized that there was no difference in the performance of the algorithm on images from men and women. In the 10,000 examinations in the Diabetes Care paper sample , we found the following performance: for 4962 men, of which 493 were expert-graded as having DR, AUC is 0.829, while for 5036 examinations from women, of which 490 were suspect, AUC is 0.821, a nonsignificant difference.
We hypothesized that there would be no effect of camera resolution on the system performance, and determined the sensitivity/specificity pairs at the optimal threshold, as outlined above, of the four types of cameras used: 640 × 480 = 0.3 Megapixels (Mp), and 45° field of view: 0.86/0.22; 768 × 576 = 0.4 Mp and 35°: 0.88/0.56; 1792 × 1184 = 2.1 Mp: 0.78/0.83; 2048 × 1536 = 3.1 Mp and 35°: 0.78/0.73. We concluded that there is slightly poorer performance for cameras less than 1 Mp, but not over.
As explained, many systems consist of multiple detection algorithms. The output of these detection systems needs to be combined in some fashion – a process we call fusion. We hypothesized that information or probability fusion of the output of the different algorithm components, as well as the output for different images and eyes for the same examination, would affect performance. Information fusion is necessary as the algorithms in the screening system process each image in the examination separately. For each image the result is an image quality measurement, and a set of bright and red lesion objects with a soft label. The soft label indicates the probability a detected object is a real lesion. The information fusion method that is used to merge all this data into an examination-level outcome had a major impact on system performance. We tested this hypothesis on another unique set of 15,000 examinations (60,000 images), called ‘TMI2008’, and found that the choice of fusion system [12,46] has a substantial influence on the overall system performance, with simple fusion methods obtaining an AUC of 0.82, while a supervised fusion system obtained an AUC of 0.881 .
We hypothesized that addition of the location of DR lesions with respect to the fovea, especially for exudates, will improve performance of the algorithm compared with a human expert, because an exudate is more alarming when it is closer to the fovea . We determined AUC including and excluding foveal distance for red and bright lesion features on 15,000 examinations, leaving all other features and parameters the same as ‘DC 2008’. We found the AUC was 0.900 when excluding, and 0.902 including distance – a nonsignificant difference .
As described earlier, a system that is used on real patients should not only be evaluated on its sensitivity and specificity for DR detection, as these measurements do not accurately reflect the full system safety. To maximize screening safety, the system must be capable of detecting rare, and typical larger, high-risk or high-morbidity nondiabetic lesions . Therefore, we have analyzed the type of false-negatives (incorrectly missed exams and lesions) missed by our system when applied to the TMI2008 dataset. The system’s threshold was set to obtain a sensitivity of 0.96 and 0.50 specificity on these examinations: at this setting it produced 25 false-negatives. A different, unmasked, retinal expert was asked to assess these 25 false-negatives as a second reader, and agreed with the first reader in 22 out of 25 cases. Most false-negatives, 13 out of 25, were examinations that contain a low number (i.e., up to two) of relatively large hemorrhages connected to, or just adjacent to, the vasculature and no other abnormalities. The next major group, eight out of 25, contained examinations with a small number (i.e., up to four) of isolated exudates close to the fovea without other abnormalities. The remaining two out of 25 were very subtle – that is, they contained a single MA or a laser scar. The two out of 25 where the second reader disagreed were choroidal nevi, not associated with DR.
We hypothesized that content-based image retrieval can detect a single type of lesion [27,49–51]. We tested this hypothesis for retinal NV, as marked by a single reader, on the ‘TMI2008’ dataset. A probability of presence of NV was assigned for each examination, based on the distribution of the wavelet transform coefficients of each image, and an AUC of 0.79 was found for NV detection. No lesion-specific segmentation was used, meaning that we expect not to need to develop other gross abnormality segmentations. We developed a retinal atlas by registering all images of the same eye for a patient, locating the disc, fovea and vascular arcades automatically with the aforementioned algorithms, and then warping the retinal image into the atlas to a standard fovea-, disc-, arcade-based retinal coordinate system. We built an atlas for 300 patients with diabetes without DR, and thus know the expected properties and probabilities for every pixel in a retinal image. Gross lesions are identified as large, contiguous areas of pixels that are two standard deviations away from the expected value in the atlas.
To drive the development of ever better image analysis methods, research groups have established publicly available, annotated image databases in various fields. Examples are the Structured Analysis of the Retina (STARE) [52,53], Digital Retinal Images for Vessel Extraction (DRIVE) [23,54] and MESSIDOR  databases for retinal images. Even though evaluations on these types of jointly used datasets are already a large improvement over evaluation on separate datasets, different groups tend to use different metrics to compare system performance, and even when using the same evaluation measures, implementation details of these metrics may have an influence on the final result.
The next step was providing publicly available annotated datasets of online, standardized evaluation, leading to an ‘asynchronous competition’. The Middlebury Stereo Vision website  was the first of these , and meanwhile others have become available .
In an asynchronous competition, all results are evaluated using the same evaluation software, but groups are allowed to submit results continuously over time. Some groups may be tempted to play the system – for example, by using human readers to assist the performance of their algorithm, or iteratively improve performance by submitting multiple results serially and using the obtained performance to tune the algorithm. Synchronous competitions have developed where there is a deadline for submitting results, and the competition results are announced at a single moment in time. These kinds of joint evaluations on a common dataset have the potential to steer future research by showing the failure modes of certain techniques and guide the practical application of techniques in clinical practice, especially if appropriate reward mechanisms are available, such as for the highly successful Netflix competition .
The first Retina Online Challenge (ROC) MA detection competition was organized in 2009 . We made available online a set of images with the expert readings of MAs for training any supervised algorithm, and another set with images where we withheld the expert readings. In total, 23 groups participated, six groups submitted their results for comparison and we recently published the final results .
We found no significant improvement in performance on the ‘TMI2008’ dataset using the best-performing algorithm from the first Retina Online Challenge. Dr Gwénolé Quellec was the author of the Retina Online Challenge’s best performing algorithm while in Brest, France, and he recently joined our team in Iowa. We hypothesized that his algorithm might show a better AUC than our system on the same dataset, because it is based on a somewhat different approach, namely a wavelet transform, as discussed previously . We found the AUC for this algorithm was 0.82 and the AUC for the original Iowa algorithm was 0.84 (not significantly different), if they both were limited to red lesion detection for comparison purposes. The AUC for a combination of both algorithms was 0.86, which was a significant improvement. The Netflix prize experience has also unequivocally shown that combining disparate algorithms can lead to a performance improvement . In this study, we also looked at the maximum achievable performance given specific characteristics of the dataset, and found that these algorithms are approaching the maximum achievable AUC .
Scotland and colleagues have performed an analysis of the cost–effectiveness of a system for automated detection of DR . In a study of 6722 patients, they modeled fully manual grading, as well as automated reading used as a filter for human. Their model was based on the refer–no refer approach that is also used in other screening programs [9,10]. They estimated the cost per patient for automated detection to be UK£0.14 (~US$0.25), appropriate for the government-subsidized healthcare system in Scotland. Based on established criteria for risk transitions and costs of different states, they concluded that automated and manual methods result in approximately the same number of correct screening outcomes, but at lower cost for automated methods. Their sensitivity analysis showed that cost–effectiveness was primarily sensitive to the sensitivity/specificity of the primary human reader, and much less so to the prevalence or incidence of DR or the subtypes of DR. They did not perform quality-adjusted life years analysis.
Diabetic retinopathy detection algorithms achieve comparable performance to a single retinal expert reader, and are close to mature: further measurable improvements in detection performance have become difficult to achieve. For translation into clinical practice sooner rather than later, validation on well-defined populations of patients with diabetes, with variable metabolic control and racial and ethnic diversity, are more urgent than further algorithm development. We anticipate that automated systems based on algorithms, such as those discussed here, will allow cost-effective early detection of DR in millions of people with diabetes, and allow triage of those patients who need further care at a time when they have early rather than advanced DR.
It is clear that automated and computer-assisted detection of DR have matured rapidly over the last few years. The basic approaches are relatively stable, and improvements are primarily occurring at the leading edge, including detection of confounder and additional lesions, and detection of rare diseases. In fact, some groups have moved toward commercialization and clinical deployment.
Over the next 5 years, we expect research in automated detection of diabetic retinopathy (DR) to focus on the following key issues:
Supported by National Institutes of Health R01 EY017066; Research to Prevent Blindness (RPB), NY, USA; University of Iowa; and Netherlands Organization for Health Related Research (NWO). One patent (USPTO 7474775) and six patent applications are in place for some of the algorithms described.
Financial & competing interests disclosure
The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.
No writing assistance was utilized in the production of this manuscript
Michael D Abramoff, Departments of Ophthalmology and Visual Sciences and Electrical Computer Engineering, University of Iowa, 11290C PFP UIHC, 200 Hawkins Drive, Iowa City, IA 52242, USA, Tel.: +1 319 384 5833, Email: ude.awoiu@ffomarba-leahcim.
Meindert Niemeijer, Department of Ophthalmology and Visual Sciences, University of Iowa Hospitals and Clinics, 200 Hawkins Drive, Iowa City, IA 52242, USA and Image Sciences Institute, University of Utrecht, The Netherlands.
Stephen R Russell, Department of Ophthalmology and Visual Sciences, University of Iowa, Hospitals and Clinics, 200 Hawkins Drive, Iowa City, IA 52242, USA.