Development of a Synthetic Oncology Pathology Dataset for Large Language Model Evaluation in Medical Text Classification
Hackl WO, Neururer SB, Richter S, Taha H, Muehlboeck H, Hickmann C, Gscheidlinger P, Danler M, Schweitzer M, Ueberegger M, Pfeifer B
BACKGROUND: Large Language Models (LLMs) offer promising applications in oncology pathology report classification, improving efficiency, accuracy, and automation. However, the use of real patient data is restricted due to legal and ethical concerns, necessitating privacy-compliant alternatives.
OBJECTIVES: This study aimed to develop a synthetic oncology pathology dataset to serve as a benchmark for LLM evaluation, enabling reproducible and privacy-preserving AI research.
METHODS: A total of 227 synthetic pathology reports were generated using Microsoft Copilot, ChatGPT Plus, and Perplexity Pro to ensure structural and linguistic diversity. The dataset included cases of prostate (n=75), lung (n=78), and breast (n=74) cancer, evenly distributed between malignant (n=113) and benign (n=114) findings. Reports were reviewed and classified by three independent cancer registrars using a consensus-based validation process.
RESULTS & CONCLUSION: The dataset provides a structured, clinically relevant benchmark for evaluating LLM performance in pathology text classification. It enables AI model assessment without compromising data privacy, paving the way for scalable and ethical AI-driven oncology documentation.
Studies in health technology and informatics, 2025-04-26