The evaluation of artificial intelligence in mammography-based breast cancer screening: Is breast-level analysis enough?

OBJECTIVES: To assess whether the diagnostic performance of a commercial artificial intelligence (AI) algorithm for mammography differs between breast-level and lesion-level interpretations and to compare performance to a large population of specialised human readers.
MATERIALS AND METHODS: We retrospectively analysed 1200 mammograms from the NHS breast cancer screening programme using a commercial AI algorithm and assessments from 1258 trained human readers from the Personal Performance in Mammographic Screening (PERFORMS) external quality assurance programme. For breasts containing pathologically confirmed malignancies, a breast and lesion-level analysis was performed. The latter considered the locations of marked regions of interest for AI and humans. The highest score per lesion was recorded. For non-malignant breasts, a breast-level analysis recorded the highest score per breast. Area under the curve (AUC), sensitivity and specificity were calculated at the developer's recommended threshold for recall. The study was designed to detect a medium-sized effect (odds ratio 3.5 or 0.29) for sensitivity.
RESULTS: The test set contained 882 non-malignant (73%) and 318 malignant breasts (27%), with 328 cancer lesions. The AI AUC was 0.942 at breast level and 0.929 at lesion level (difference -0.013, p < 0.01). The mean human AUC was 0.878 at breast level and 0.851 at lesion level (difference -0.027, p < 0.01). AI outperformed human readers at the breast and lesion level (ps < 0.01, respectively) according to the AUC.
CONCLUSION: AI's diagnostic performance significantly decreased at the lesion level, indicating reduced accuracy in localising malignancies. However, its overall performance exceeded that of human readers.
KEY POINTS: Question AI often recalls mammography cases not recalled by humans; to understand why, we as humans must consider the regions of interest it has marked as cancerous. Findings Evaluations of AI typically occur at the breast level, but performance decreases when AI is evaluated on a lesion level. This also occurs for humans. Clinical relevance To improve human-AI collaboration, AI should be assessed at the lesion level; poor accuracy here may lead to automation bias and unnecessary patient procedures.

© 2025. The Author(s).
European radiology, 2025-06-27