Individual Submission Summary
Share...

Direct link:

What Do Language Models Know About the Past? LLM Representation of U.S. Labour Force Structure, 1850–2024

Sat, August 8, 10:00 to 11:30am, TBA

Abstract

Large language models (LLMs) are rapidly entering the toolkits of researchers across the social sciences and humanities. If these models can accurately represent aspects of the past, they could enable researchers to simulate historical actors or approximate the kinds of experiments and surveys possible in contemporary sociology, but not currently possible in historical research. But what do LLMs actually "know" about history? LLMs are trained on corpora reflecting particular patterns of documentation, preservation, and digitization of information. They currently cannot cannot distinguish between fact and fiction, or between primary sources and retrospective accounts. If LLMs encode such asymmetries, researchers risk mistaking their biases the historical record itself. In order to realize the promise of LLMs for historical research, we need to better understand whether they can accurately represent the past, and in what ways. This paper takes an empirical approach to a question often treated speculatively: can we trust historical information generated by LLMs? We focus on a domain where reliable benchmarking exists: occupational composition of the U.S. labour force from 1850 to 2024, comparing outputs from six LLMs to census data ground truth. We find that LLMs appear to encode meaningful historical knowledge, but this knowledge is systematically uneven. Models perform well on the distant past (1850–1880) and the recent past (1950–2024), but seem to struggle with the early 20th century (1900–1940). This pattern holds across all models regardless of size or architecture, suggesting gaps in training corpora rather than model-specific limitations. Our findings suggest that LLMs may prove useful for historical research, but require careful validation. Periods falling between the public domain and the digital age, where copyright restricts access to primary sources and internet documentation is sparse, may be particularly prone to gaps and distortions that would not be apparent without ground truth comparison.

Authors