Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model
posted on 2025-06-24, 18:16authored byKelly F Molena, Ana P Macedo, Anum Ijaz, Fabrício K Carvalho, Maria Julia D Gallo, Francisco Wanderley Garcia de Paula E Silva, Andiara de Rossi, Luis A Mezzomo, Leda MugayarLeda Mugayar, Alexandra M Queiroz
BACKGROUND: Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making. OBJECTIVE: This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions. MATERIALS AND METHODS: Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05). RESULTS: Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions. CONCLUSIONS: ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.
Funding
Universal Call MCTI/CNPq | Funder: National Council for Scientific and Technological Development
Large Research Grant | Funder: International Team for Implantology
Clinical and laboratory studies in Oral Rehabilitation | Funder: Coordenação de Aperfeicoamento de Pessoal de Nível Superior | Grant ID: 433011
Risk Factors in the Prognosis of Complete Removable Dentures Retained by Implants | Funder: Coordenação de Aperfeicoamento de Pessoal de Nível Superior | Grant ID: 433005
Risk factors in the prognosis of removable total dentures retained by extra-short implants (4-mm) in severely resorbed mandibles - randomized clinical trial | Funder: National Council for Scientific and Technological Development
Straumann Research Grant | Funder: Straumann Dental Implant System
History
Citation
Molena, K. F., Macedo, A. P., Ijaz, A., Carvalho, F. K., Gallo, M. J. D., Wanderley Garcia de Paula E Silva, F., de Rossi, A., Mezzomo, L. A., Mugayar, L. R. F.Queiroz, A. M. (2024). Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model. Cureus, 16(7), e65658-. https://doi.org/10.7759/cureus.65658