Same, Same? Ensuring Comparative Equivalence in the Semantic Analysis of Heterogeneous, Multilingual Corpora

Fri, May 26, 14:00 to 15:15, Hilton San Diego Bayfront, Floor: 3, Aqua 309


Many computational approaches are optimized for treating large quantities of similar textual material, but suffer from severe validity problems if applied to heterogeneous or multilingual corpora. This paper explicates and discusses several tacit assumptions about the homogeneity of analyzed material that are hard coded into many existing tools, and reviews possible biases and artifacts that result from applying these to heterogeneous discourse. Specifically, the paper argues that operational equivalence must be established both at the level of semantic entities, and the level of meaningful associations between these. Introducing Jamcode, a dictionary-based tool developed for the comparative analysis of statements and frames expressed in political debates, strategic communication, news, and social media in eight languages, the paper presents suitable strategies for ensuring equivalence. Presenting results from a sequence of validation studies, the paper discusses specific trade-offs and remaining limitations of the tool, and charts the way for further development.