| Title |
Comparison of validity and reliability of manual consensus grading vs. automated AI grading for diabetic retinopathy screening in Oslo, Norway: a cross-sectional pilot study |
| Authors |
Karabeg, Mia ; Petrovski, Goran ; Holen, Katrine ; Steffensen Sauesund, Ellen ; Fosmark, Dag Sigurd ; Russell, Greg ; Erke, Maja Gran ; Volke, Vallo ; Raudonis, Vidas ; Verkauskienė, Rasa ; Sokolovska, Jelizaveta ; Moe, Morten Carstens ; Kjellevold Haugen, Inga-Britt ; Petrovski, Beata Eva |
| DOI |
10.3390/jcm14134810 |
| Full Text |
|
| Is Part of |
Journal of clinical medicine.. Basel : MDPI. 2025, vol. 14, iss. 13, art. no. 4810, p. 1-14.. ISSN 2077-0383 |
| Keywords [eng] |
EyeArt ; artificial intelligence (AI) ; automated grading ; diabetic macular edema ; diabetic retinopathy ; diagnostic accuracy ; fundus photography ; manual consensus grading ; screening program |
| Abstract [eng] |
Background: Diabetic retinopathy (DR) is a leading cause of visual impairment worldwide. Manual grading of fundus images is the gold standard in DR screening, although it is time-consuming. Artificial intelligence (AI)-based algorithms offer a faster alternative, though concerns remain about their diagnostic reliability. Methods: A cross-sectional pilot study among patients (≥18 years) with diabetes was established for DR and diabetic macular edema (DME) screening at the Oslo University Hospital (OUH), Department of Ophthalmology, and the Norwegian Association of the Blind and Partially Sighted (NABP). The aim of the study was to evaluate the validity (accuracy, sensitivity, specificity) and reliability (inter-rater agreement) of automated AI-based compared to manual consensus (MC) grading of DR and DME, performed by a multidisciplinary team of healthcare professionals. Grading of DR and DME was performed manually and by EyeArt (Eyenuk) software version v2.1.0, based on the International Clinical Disease Severity Scale (ICDR) for DR. Agreement was measured by Quadratic Weighted Kappa (QWK) and Cohen's Kappa (κ). Sensitivity, specificity, and diagnostic test accuracy (Area Under the Curve (AUC)) were also calculated. Results: A total of 128 individuals (247 eyes) (51 women, 77 men) were included, with a median age of 52.5 years. Prevalence of any vs. referable DR (RDR) was 20.2% vs. 11.7%, while sensitivity was 94.0% vs. 89.7%, specificity was 72.6% was 83.0%, and AUC was 83.5% vs. 86.3%, respectively. DME was detected only in one eye by both methods. Conclusions: AI-based grading offered high sensitivity and acceptable specificity for detecting DR, showing moderate agreement with manual assessments. Such grading may serve as an effective screening tool to support clinical evaluation, while ongoing training of human graders remains essential to ensure high-quality reference standards for accurate diagnostic accuracy and the development of AI algorithms. |
| Published |
Basel : MDPI |
| Type |
Journal article |
| Language |
English |
| Publication date |
2025 |
| CC license |
|