Search
Browse By Day
Browse By Time
Browse By Person
Browse By Committee or SIG
Browse By Session Type
Browse By Keywords
Browse By Geographic Descriptor
Search Tips
Personal Schedule
Change Preferences / Time Zone
Sign In
The measurement of foundational learning has emerged as a global priority, with Sustainable Development Goal 4.1.1a tracking minimum proficiency in reading and mathematics at the end of lower primary. However, the downgrading of the indicator in late 2023 due to concerns about validity and comparability, brought renewed scrutiny to how foundational learning is assessed in low-resource contexts. This critical review evaluates the current evidence on the measurement of decoding in the early grades, with a focus on contextual relevance for the African continent. It explores three interrelated questions: (1) whether fluency or accuracy is a better indicator of decoding proficiency; (2) how assessment protocols (e.g., timing and stopping rules) affect validity; and (3) whether receptive tasks can serve as effective substitutes or complements to productive decoding measures. Beyond psychometric validity, the review situates these questions in relation to language of assessment, and the affordability, feasibility, and sustainability of different approaches in low-resource contexts.
Fluency versus accuracy
The balance of empirical evidence suggests that fluency is a more discriminating and policy-relevant indicator than accuracy in the early grades, especially in transparent orthographies where accuracy rapidly approaches ceiling levels. Meta-analytic work confirms that fluency correlates more strongly with comprehension than accuracy does, particularly in the lower grades. Comparative evidence from large-scale datasets (e.g., PASEC, MICS) similarly shows that accuracy-based measures can obscure meaningful variation, while fluency distributions remain wide and informative. Taken together, this evidence suggests that fluency should remain central in early grade assessments, with accuracy reported as complementary information rather than a substitute.
Protocols: stopping rules and timing
Protocol choices shape both the validity and feasibility of assessment. Evidence on stopping rules indicates that terminating subtasks after a small number of consecutive failures results in minimal loss of information while reducing administration time, though more work is needed on subgroup effects and conditional probabilities. With respect to timing, studies consistently show that restricting text reading to one minute underestimates comprehension, since many learners never reach the items they are later tested on. Allowing more time (e.g., three minutes) substantially improves the validity of comprehension scores without materially altering fluency measures. These findings highlight the importance of aligning timing and stopping rules with the constructs being measured, particularly when comprehension is included.
Receptive tasks
Receptive tasks—including letter–sound recognition, sentence verification, maze, and slasher tasks—offer promise for group-based or digital administration, but they capture related rather than identical constructs. They often show stronger correlations with fluency than with comprehension and can be susceptible to guessing, floor effects, and ceiling effects depending on design. While such tasks may provide cost savings or scalability advantages, they should be regarded as complements rather than replacements for oral fluency measures in early grades. Continued validation work is needed to clarify the contexts in which receptive tasks add most value.
Language of assessment
Language remains a central validity concern in foundational assessments. In many contexts, children are tested in international or national languages rather than their language of instruction or home language, distorting results and undermining equity. Contrasts between performance in local languages and international languages (e.g., Burundi Grade 2 vs. Grade 6) demonstrate how profoundly assessment language shapes outcomes. This issue remains underappreciated in much of the international research literature, with some high-profile studies insufficiently accounting for linguistic variation. Addressing language appropriately is thus not only a technical matter but also a question of fairness in global monitoring.
Cost, feasibility, and reliability
Cost and feasibility concerns often motivate calls for alternatives to one-on-one oral assessments. The main arguments against such assessments are twofold: (a) inter-rater reliability concerns, and (b) higher costs relative to group-administered or self-administered methods. However, available evidence indicates that inter-rater reliability is typically high, even for fluency tasks, when assessors are adequately trained. In fact, studies comparing tasks with and without fluency measurement find similar levels of inter-rater reliability with the same assessors, suggesting that reliability is not the main constraint. Rather, the additional administration time and cost are the relevant trade-offs, for which systematic comparative costing data are lacking.
For group-administered and self-administered approaches, both reliability and validity issues require closer scrutiny. A few validity studies indicate that inter-rater reliability can be high even in one-on-one contexts, but comparable evidence for group or digital modes is very limited. Similarly, while group administration may promise cost savings, the lack of transparent costing data makes it difficult to evaluate trade-offs between efficiency and the validity of information obtained. As with within-mode comparisons, better evidence on costs, training requirements, and inter-rater reliability across different modes of administration is essential.
Psychometric modelling of continuous data
The widespread use of item response theory (IRT) in large-scale assessments has raised concerns about whether methodological convenience is driving design choices. Continuous measures such as fluency do not always fit well within conventional IRT frameworks, leading some to discretize or band fluency data, which risks discarding meaningful variation. More complex psychometric models, including explanatory IRT and response-time models, are available and could be applied more systematically to preserve information and better capture the underlying constructs. Future work should address how these modelling choices influence benchmarks, comparability, and the policy signals drawn from assessment data.
Implications and priorities
Overall, this review underscores that fluency remains an essential indicator of early reading proficiency in LMICs, but its utility depends on careful attention to protocols, language, and modelling choices. Receptive tasks and new modes of administration offer promising complements but require further validation before they can substitute for fluency measures in foundational grades. Inter-rater reliability does not appear to be the central challenge in one-on-one fluency assessment; rather, the key issues are cost and feasibility, both of which require systematic, transparent study across different modalities. Finally, psychometric approaches must adapt to continuous fluency data rather than reshaping the data to fit the tool.