Comparative Evaluation of Proprietary and Open-Source Large Language Models for Systematic Multi-source Information Extraction in Interventional Oncology

Cardiovasc Intervent Radiol. 2025 Dec 7. doi: 10.1007/s00270-025-04287-1. Online ahead of print.

ABSTRACT

PURPOSE: To compare proprietary (GPT-4o, Gemini 1.5 Pro) and open-source (Llama 3.1 70B, Llama 3.1 405B) large language models (LLMs) for extracting clinically relevant variables from transarterial chemoembolization (TACE) reports in patients with hepatocellular carcinoma (HCC).

METHODS: Retrospective analysis of 556 anonymized longitudinal TACE-related reports (radiology, interventional procedure, and clinical follow-up) from 50 patients with HCC treated between 2012 and 2024 at a single tertiary center was carried out. Models extracted predefined binary variables (e.g., modified Response Evaluation Criteria in Solid Tumors [mRECIST] tumor response, alpha-fetoprotein [AFP] dynamics, Barcelona Clinic Liver Cancer [BCLC] stage) and ordinal variables (e.g., liver segment involvement, vascular invasion, follow-up assessment) using a standardized system prompt and output template. Model performance was assessed by accuracy, ordinal scores, and longitudinal error rates using mixed-effects regression with patient-level random intercepts.

RESULTS: Proprietary models outperformed open-source models. GPT-4o and Gemini achieved the highest mean accuracies for binary variables (0.87 ± 0.21 and 0.85 ± 0.16) and ordinal variables (4.15/5 and 4.10/5), significantly exceeding both Llama models (p < 0.05). GPT-4o showed the lowest longitudinal error rate for binary variables (0.01 vs 0.09-0.21 for the other models), indicating greater robustness over time. All models showed poor performance in vascular invasion detection and follow-up assessment.

CONCLUSION: Proprietary LLMs can accurately extract most key TACE-related variables from routine clinical reports and may support decision-making in interventional oncology; however, all models showed poor performance in vascular invasion detection and follow-up assessment, so expert human oversight remains essential.

PMID:41354880 | DOI:10.1007/s00270-025-04287-1

John Joseph