Search
On-Site Program Calendar
Browse By Day
Browse By Time
Browse By Person
Browse By Room
Browse By Unit
Browse By Session Type
Search Tips
Change Preferences / Time Zone
Sign In
Bluesky
Threads
X (Twitter)
YouTube
Objectives: This study addresses the need for evaluating texts created by generative artificial intelligence (GenAI). As GenAI becomes increasingly integrated into educational content creation (Ullmann et al., 2024) we must evaluate its quality. GenAI texts are promising because they can easily adjust text difficulty to match readers' needs, supporting learners who often struggle with expository texts, including multilingual learners (Sudin & Swanto, 2024) and students with disabilities (Bakken et al., 1997). To examine the quality of GenAI texts we produced and evaluated elementary science texts from four leading GenAI models across three critical dimensions: adherence to grade-appropriate literacy standards, vocabulary and concept overlap, and visual content effectiveness.
Theoretical Framework: Expository texts pose a challenge for elementary students (Adams et al., 2024) because they often present an increased cognitive load through novel ideas, complex vocabulary, and text structure (Sweller, 2020). Castro-Alonso et al. (2021) have shown that to reduce cognitive load, texts require visualizations that complement texts, cohesive content, vocabulary overlap, and clear text structure. We consider expository text reading challenges while incorporating multimodal learning theory that recognizes the importance of integrating text and visual elements for effective science education. We also consider the need for high quality content based on Next Generation Science Standards (NGSS) that emphasize concept explanation and appropriate scaffolding to build scientific literacy.
Methods: We used controlled prompts for three key science topics identified in NGSS (Sound Waves for Grade 1, Tornadoes for Grade 3, Solar System for Grade 5) with specific parameters including text length, Lexile levels, and NGSS alignment requirements.
Data Sources: Primary data consisted of 24 science texts generated across four GenAI models (Claude 3.7, ChatGPT 4o, Gemini 2.5 pro experimental, DeepSeek V3 R1), encompassing both text-only and text-with-visuals formats across three grade levels. We used TextEvaluator (Sheehan et al, 2014) and Coh-Metrix (Graesser et al., 2004, 2011) for measuring linguistic complexity, academic vocabulary usage, lexical cohesion, topic cohesion, and syntactic complexity. Visual quality assessment was based on criteria across multiple dimensions including clarity, motivational value, and text-visual integration (Devetak & Cogrinc, 2013).
Results: All models consistently produced texts longer than specified targets, with Gemini exceeding targets by 46-67%. All models failed to produce appropriate first-grade texts. Academic vocabulary scores were substantially higher than norms for all models except Claude. DeepSeek demonstrated most consistent alignment with expected complexity ranges across grade levels, while ChatGPT particularly struggled with first-grade complexity. For visual quality, Gemini produced the highest quantity of images, though no model demonstrated consistently high visual quality across all grade levels.
Scientific Importance: This study establishes a multidimensional benchmarking framework specifically designed to evaluate GenAI-generated elementary science texts. The results highlight the need for developing benchmarks for the responsible integration of GenAI tools in creating expository texts. The framework creates a foundation for future research examining GenAI improvement in creating texts and informs policy decisions regarding GenAI adoption in education.