Search
Browse By Day
Browse By Time
Browse By Person
Browse By Policy Area
Browse By Session Type
Browse By Keyword
Program Calendar
Personal Schedule
Sign In
Search Tips
Achievement test scores are widely used in economics and social science. Their use is often motivated, either implicitly or explicitly, by the idea that they are proxies for human capital, an idea which is bolstered by the strong correlations that exist between test scores and economically important outcomes such as school completion and labor market earnings.
The basic input data for any achievement test score are (1) the student's responses to the individual questions (“items”) and (2) the actual content of these items. A test scale can fundamentally be thought of as an algorithm which aggregates the full vector of a student's item responses into a scalar -- the test score. For example, a simple “percent correct” scoring rule aggregates by weighting each item equally -- two students with the same number of correct responses will receive the same score regardless of which specific items they answer correctly. Modern psychometric methods such as item response theory (IRT) aggregate in a more theoretically motivated way, but they still represent particular choices about how much to emphasize different items.
Traditional item aggregation methods do not consider the item’s economic usefulness of the skills assessed to create the test score. As a result, the skills emphasized by these tests may not be the same as the skills most valuable for the economic outcomes of ultimate interest for social scientists. In this paper we construct ``item-anchored'' scales that aggregate items to maximize the predictive accuracy (mean squared error) of the test score for a particular economic outcome (wages, school completion, etc.), using administrative data from Texas covering about 12 million students.
We find that these item-anchored scores have notably different implications for both individual and group-level achievement differences compared to standard psychometric measures, including racial achievement gaps. The main driver of the difference between our and traditional measurements of achievement gaps is driven by the fact that minority students respond correctly to questions that are predictive of long-term wages at a lower rate than their white counterparts - a fact not taken into account in usual achievement gaps.
To open the ’black box’ of what characteristics of the question are linked with a higher weight in our method, we develop a framework to explore the relationship between our item-level weights and traditional psychometric characteristics, as well as with potentially new skills captured by the text of the question itself. For the latter, we digitize the test booklets and apply natural language processing (NLP) and other cutting-edge computational techniques to extract the underlying skills/item characteristics driving these achievement gaps.