Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Background
Medical Maintenance of Certification (MOC) programs usually deliver brief, generic critiques that physicians skim in about thirty seconds and rarely revisit, offering little educational value. Because these one size fits all comments neither probe individual reasoning nor address specific knowledge gaps, they seldom trigger the reflection needed for durable learning (Esterhazy & Damşa, 2019; O’Donovan et al., 2021). Empirical studies in medical education show that feedback aligned with a learner’s actual performance captures attention, strengthens metacognitive monitoring, and channels study time toward true deficiencies (Abraham & Singaram, 2021; Iraj et al., 2020; Karaoglan Yilmaz & Yilmaz, 2022; Tan et al., 2016). Until recently such personalization was too resource intensive to scale. Large language models (LLMs) now change that equation: by tracking each physician’s responses over time, an LLM can pose targeted follow up questions, correct misconceptions, and provide concise, evidence based explanations in real time, turning static critiques into dynamic coaching dialogues that encourage repeated engagement (Arif et al., 2023; Breeding et al., 2024).
Study Design
This study has received IRB approval from AAFP. This pilot will enroll 60 board certified family physicians representing diverse ages, genders, regions, and practice settings. All participants will answer twelve online multiple-choice items per quarter for six consecutive quarters. After a shared twelve item baseline block (Quarter 0), they are randomized—stratified for demographics—into a control group and an experimental group. Both groups face identical clinical content; the control arm receives concise, noninteractive explanations, whereas the experimental arm receives AI generated clone items and interacts with a finetuned GPT4o model that, for each incorrect or high confidence response, prompts reflection, conducts up to three short exchanges, and then offers a tailored critique anchored to current evidence. Spaced repetition is embedded within both groups: single clone item exposure in Quarter 1 or 2, or double exposure in 1 + 2, 1 + 3, or 2 + 4, while Quarter 5 is left blank to reduce recognition before the final assessment in Quarter 6 (Table 3.1).
Hypotheses and Statistical Analysis
The hypothesis and statistical analysis are described below:
Hypothesis 1: Experimental groups (Group 1-5) will demonstrate higher correct percentages, confidence, and calibration (alignment between response correctness and confidence level) compared to control group in Q6. T-test of correct percentage, confidence rating and calibration will be utilized in this hypothesis testing.
Hypothesis 2: Two repetition groups (Group 3-5) will demonstrate higher correct percentages, confidence and calibration compared to one repetition groups (Group 1-2) in Q6.
Hypothesis 3: There are no significant differences of correct percentages, confidence, and calibration among two repetition groups (Group 3-5). ANOVA will be utilized in this hypothesis testing.
Implication
Verification of these hypotheses would demonstrate a cost-efficient pathway for medical certification boards to pair AI generated clone questions with conversational LLM coaching, expanding item pools while personalizing learning at scale. Documented gains in diagnostic accuracy and confidence would justify embedding adaptive dashboards into MOC programs, sustaining physician engagement and sharpening metacognitive skills. By offering physicians targeted, evidence linked feedback within routine assessment, medical certification boards could advance their missions of lifelong competence and ultimately improve patient care.