BMC Med Inform Decis Mak. 2025 Sep 1;25(1):325. doi: 10.1186/s12911-025-03035-2.
ABSTRACT
BACKGROUND: To synthesize the results of various studies on the readability of ChatGPT and Bard in medical communication.
METHODS: Systemic literature research was conducted in PubMed, Ovid/Medline, CINAHL, Web-of-Science, Scopus and GoogleScholar to detect relevant publications (inclusion criteria: original research articles, English language, medical topic, ChatGPT-3.5/-4.0, Bard/Gemini, Flesch Reading Ease Score (FRE), Flesch Kincaid Grade Level (FKGL)). Study quality was analyzed using modified Downs-and-Black checklist (max. 8 points), adapted for studies on large language model. Analysis was performed on text simplification and/or text generation with ChatGPT-3.5/-4.0 versus Bard/Gemini. Meta-analysis was conducted, if outcome parameter was reported ≥ 3 studies. In addition, subgroup-analyses among different chatbot versions were performed. Publication bias was analyzed.
RESULTS: Overall, 59 studies with 2342 items were analyzed. Study quality was limited with a mean of 6 points for FRE and 7 points for FKGL. Meta-analysis of text simplification for FRE between ChatGPT-3.5/-4.0 and Bard/Gemini was not significant (mean difference (MD):5.03; 95%-confidence interval (CI):-20.05,30.11; p = 0.48). FKGL of simplified texts of ChatGPT-3.5/-4.0 and Bard/Gemini was borderline significant (MD:-1.59; CI:-3.15,-0.04; p = 0.05) and subgroup-analysis between ChatGPT-4.0 and Bard was not significant (MD:-1.68; CI:-3.53,0.17; p = 0.07). Focused on text acquisition, MD for FRE and FKGL of studies on ChatGPT-3.5/-4.0- and Bard/Gemini-generated texts were significant (MD:-10.36; CI:-13.08,-7.64; p < 0.01 / MD:1.62; CI:1.09,2.15; p < 0.01). Subgroup-analysis of FRE was significant for ChatGPT-3.5 vs. Bard (MD:-16.07, CI:-24.90,-7.25; p < 0.01), ChatGPT-3.5 vs. Gemini (MD:-4.51; CI:-8.73,-0.29: p = 0.04), ChatGPT-4.0 vs. Bard (MD:-12.01, CI:-16.22,-7.81; p < 0.01) and ChatGPT-4.0 vs. Gemini (MD:-7.91, CI:-11.68,-4.15; p < 0.01). Analysis of FKGL in the subgroups was significant for ChatGPT-3.5 vs. Bard (MD:2.85, CI:1.98,3.73; p < 0.01), ChatGPT-3.5 vs. Gemini (MD:1.21, CI:0.50,1.93; p < 0.01) and ChatGPT-4.0 vs. Gemini (MD:1.95, CI:1.05,2.86; p < 0.01), but it was not significant for ChatGPT-4.0 vs. Bard (MD:0.64, CI:-0.46,1.74; p = 0.24). Egger’s test was significant in text generation for FRE and FKGL (p < 0.01 / p < 0.01) and in subgroup ChatGPT-4.0 vs. Bard and ChatGPT-4.0 vs. Gemini (p < 0.01 / p = 0.02) for FRE as well as in subgroups ChatGPT-3.5 vs. Bard and ChatGPT-4.0 vs. Gemini for FKGL (p < 0.01 / p < 0.01).
CONCLUSION: Readability of spontaneously generated texts by Bard/Gemini was slightly superior compared to ChatGPT-3.5/-4.0 and readability of simplified texts by ChatGPT-3.5/-4.0 tended to be improved compared to Bard. Results are limited due study quality and publication bias. Standardized reporting could improve study quality and chatbot development.
PMID:40890707 | DOI:10.1186/s12911-025-03035-2