Name: CIES 2026 Annual Conference
Start: 2026-03-28T00:00:00-07:00
End: 2026-04-02T00:00:00-07:00

Information Menu
Search Tips

Back Home

Refresh: Off View Personal Schedule

Individual Submission Summary

Share...

Direct link:

Utilizing Item-Level Assessment Data at Scale: Insights from the Let’s Mark Platform in Rwanda

In Event: Creating strong Learning Systems: Assessment Design, Data Utilization, and Foundations for Peaceful Societies

Tue, March 31, 4:30 to 5:45pm, Hilton, Floor: Ballroom Level - Tower 2, Franciscan C

Proposal

Motivation
Accurate learning data is essential for both policy design and classroom instruction. While some systems collect information on foundational literacy and numeracy (FLN) through tools like EGRA/EGMA or impact evaluations, there is little or no standardized data on ongoing performance across grades or at a census level. Even where assessments exist, the data is often incomplete. In Rwanda, for example, Term 1 exams are designed at the school level and lack standardization. Term 3 exams are standardized but reported only as aggregate scores. Such measures show broad trends but hide irregularities and offer little diagnostic value.
Item-level data which includes the record of each pupil’s response provides far richer insights. It can highlight topic-level strengths and weaknesses, reveal irregularities, and flag test design issues. Yet historically, collecting such data was slow and costly. Recording hundreds of thousands of responses manually was simply impractical.
Let’s Mark solves this challenge. Teachers scan pupils’ answer sheets with the app, which marks each response, flags errors, and uploads results to a central platform. The process is fast, low-cost, and scalable. This makes item-level data practical for both classroom instruction and system-wide monitoring.
This study draws on evidence from Rwanda, where more than 210,000 pupils sat a 50-item numeracy test in June 2024. The results show both the feasibility of large-scale item-level collection and the new kinds of analysis it makes possible.

Research Questions
What new analyses and decisions become possible once item-level data is available?
In what ways can item-level data enhance the integrity of assessments, support classroom instruction, and strengthen assessment design?

Methods and Data
The case study draws on a large-scale numeracy assessment conducted in Rwanda in June 2024. The assessment was administered to 210,195 pupils in Primary 4 through Primary 6 (P4–P6) across 761 schools, covering all 30 districts and five provinces. The test included 50 items, with difficulty intentionally increasing every 10 questions to measure a wide range of foundational numeracy skills. In addition, more than 323,000 pupils in lower grades took shortened versions of the test.
Each student’s responses were collected through the Let’s Mark platform, ensuring that item-level data was recorded consistently and without the burden of manual entry. Items were also tagged by content domain, including number recognition, addition, ratios, probability, and word problems. This tagging allowed both psychometric analyses of test design and instructional analyses of content mastery.

Preliminary Findings
Ensuring Data Integrity
Before assessment data can be used to inform instruction or policy, stakeholders must trust that it is valid. Aggregate scores can flag simple anomalies, such as a class where every pupil scores 100%, but these checks are limited. Item-level data allows far more robust integrity analysis.
For example, in a Primary 6 class at one primary school, 39 out of 50 pupils submitted identical answers on the final ten questions, including the same three incorrect responses. By contrast, a class at a different school with a nearly identical average score showed no such pattern.
the design of the test—with difficulty increasing in successive blocks—makes it improbable that pupils would perform better on harder items than easier ones. Yet in one Primary 4 class, 72% showed exactly this suspicious pattern, compared to just 5% in another class with the same average.
Such comparisons reveal irregularities invisible in averages. Indicators such as duplicate entries, excessive missing data, or improbably high scores can be combined to provide an overall integrity rating for each class. These provide an overall integrity rating, allowing ministries to flag questionable data and decide whether to recollect or exclude it.
Improving Pedagogy and Classroom Learning
The greatest value of item-level data lies in its diagnostic power. Aggregate scores show overall performance but hide which skills pupils know or lack. Item-level analysis makes these gaps visible.
In Rwanda, pupils scored above 90% on simple addition without carrying but below 20% on probability and ratios. Such results point clearly to where instruction needs reinforcement. Disaggregation adds further insight. Pupils outside Kigali struggled more with number skills, while those in Kigali were weaker in rounding and place value. This evidence supports targeted interventions instead of one-size-fits-all approaches.
At the classroom level, item-level data avoids misleading averages. Three Primary 4 classes all scored 80% overall, yet one excelled in measurement, another in probability, and the third in estimation. Without item-level data, teachers would miss these differences; with it, they can tailor support.
Distractor analysis deepens this diagnosis. In one ratio item, half of pupils chose the same wrong option, revealing a common misunderstanding of cross-multiplication. In another, most confused a triangle’s height with its sides. Such patterns give teachers concrete guidance on how to adjust instruction.
Strengthening Assessment Quality
In addition to identifying simple design errors, item-level data also allows for empirical validation of theoretical properties embedded in assessment design. Beyond catching mistakes, item-level data enables psychometric validation. Comparing intended difficulty ratings with actual performance revealed discrepancies: ratio items intended to be of medium difficulty proved among the hardest on the test. Such evidence informs test redesign, ensuring assessments measure what they intend.
With large datasets, advanced models such as Item Response Theory become possible, allowing analysis of item discrimination, guessing, and overall reliability. This raises the quality of national assessments and, in turn, the policies built on them.

Contribution
The contribution is twofold. First, it highlights the value of item-level data: strengthening integrity, revealing learning gaps, and improving assessment design. Second, it demonstrates the scalability of technology: showing that what was once prohibitively costly is now achievable for entire systems.
As education systems aim to become more data-driven, the challenge is not simply to measure learning, but to measure it meaningfully. Item-level data offers a richer, more actionable picture than aggregate scores ever could. By making this detail available at scale, Let’s Mark bridges the gap between national assessments and classroom practice. It enables governments to make evidence-based decisions, teachers to address specific pupil needs, and policymakers to design fairer, more reliable assessments.

Utilizing Item-Level Assessment Data at Scale: Insights from the Let’s Mark Platform in Rwanda

Proposal

Authors