J Dent Res. 2025 Nov 3:220345251382452. doi: 10.1177/00220345251382452. Online ahead of print.
ABSTRACT
Accurate clinical records are fundamental to dental practice. Automatic speech recognition (ASR) has the capacity to convert spoken clinical language into written text within the electronic health record; however, the accuracy of ASR in natural language processing for clinical dentistry remains uncertain. The aim of this study was to investigate the transcriptional accuracy of ASR systems using orthodontic clinical records as the experimental model. Specifically, we used 4 commercial ASR systems (Heidi Health, DigitalTCO, Dragon Medical One, Dragon Professional Anywhere), 5 application programming interfaces (Amazon, Google, Speechmatics, Whisper, GPT4oTranscribe), and a 2-stage pipeline coupling GPT4oTranscribe with the GPT4o large language model (LLM) for generative error correction (GPT4oTranscribeCorrected). Orthodontic diagnostic and treatment planning summaries (n = 200; 10 subject domains; 43,408 words; 6 h of audio) were narrated and recorded for analysis. The primary outcome was domain word error rate (DWER), which investigates clinical terminological transcription errors against the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) database. Secondary outcomes included nondomain WER (N-DWER), lexical accuracy (Recall-Oriented Understudy for Gisting Evaluation [ROUGE] score), semantic similarity (Bidirectional Encoder Representations from Transformers [BERT] and Bidirectional and Auto-Regressive Transformer [BART] scores), hallucinations (transcribed text not in the spoken input), and qualitative error analysis. GPT4oTranscribeCorrected was transcriptionally most accurate (DWER = 3.5%; WER = 3.7%), with DWER decreasing by 54.9% versus GPT4oTranscribe. Heidi Health was the highest-performing commercial system (DWER = 6.2%; WER = 5.4%), with Dragon Professional Anywhere being the worst (WER = 33.9%). All systems were less accurate with technical vocabulary (DWER > N-DWER; P < 0.001), except GPT4oTranscribeCorrected. Significant differences were seen across systems for ROUGE, BERT, and BART scores (P < 0.001). Based on post hoc pairwise comparisons, GPT4oTranscribeCorrected performed best and Dragon Professional Anywhere was consistently worst for lexical and semantic errors. Hallucinations were absent except for Whisper (n = 57) and DigitalTCO (n = 1). Across systems, background noise increased DWER and WER (P < 0.001). Importantly, clinically significant errors were seen with all systems, ranging from 2% to 66% (GPT4oTranscribeCorrected clean; Dragon Medical One background noise, respectively). Variation in narrator accent had no effect in clean conditions (P = 0.65) and a small effect with background noise (P = 0.001). ASR systems deliver single-digit transcription error rates, particularly when coupled with LLM-based correction, but clinically significant errors persist. The verification of clinical records is essential when using current ASR systems.
PMID:41178647 | DOI:10.1177/00220345251382452