Performance Evaluation of Large Language Models With Retrieval-Augmented Generation in Cardiology Specialist Examinations in Japan

Hiromasa Hayama; Tu Hao Tran; Jin Kirigaya; Yosuke Katayama; Tomoko Negishi; Koya Ozawa; Kazuaki Negishi

doi:10.1253/circrep.CR-25-0094

This article has now been updated. Please use the final version.

Performance Evaluation of Large Language Models With Retrieval-Augmented Generation in Cardiology Specialist Examinations in Japan

Hiromasa Hayama, Tu Hao Tran, Jin Kirigaya, Yosuke Katayama, Tomoko Negishi, Koya Ozawa, Kazuaki Negishi

Author information

Keywords: Cardiology examination, Large language model, Medical education, Retrieval augmented generation

JOURNAL OPEN ACCESS FULL-TEXT HTML Advance online publication

Article ID: CR-25-0094

DOI https://doi.org/10.1253/circrep.CR-25-0094

The final version of this article is now available: Vol. 7 (2025), No. 8 pp. 692-694

Details

Abstract

Background: Large language models (LLMs) have shown potential in medical education, but their application to cardiology specialist examinations remains underexplored. We compared the performances of a retrieval-augmented generation LLM (RAG-LLM) ‘CardioCanon’ against general-purpose LLMs.

Methods and Results: A total of 96 publicly available text-based open-source multiple-choice questions from the Japanese Cardiology Specialist Examination (1997–2022) were used. CardioCanon showed similar option-level accuracy to ChatGPT-4o and Gemini 2.0 Flash (81.0%, 76.0%, and 77.2%, respectively), but higher case-based accuracy than ChatGPT (57.3% vs. 29.2%, P<0.001).

Conclusions: RAG techniques can enhance AI-assisted examination performance by improving case-level reasoning and decision-making.

Fullsize Image

Corresponding author

Register with J-STAGE for free!