Individual Submission Summary

Direct link:

Corpustools: An R Package for Text Analysis Beyond Bags of Words

Fri, May 26, 14:00 to 15:15, Hilton San Diego Bayfront, Floor: 3, Aqua 309


Numerous good packages like quanteda, tm, and topicmodels exist for creating and analysing text using document-term matrices. However, the bag-of-words assumption makes it difficult to use word-order and syntactic features required to e.g. semantic network analysis and context-aware dictionaries. Morever, dropping the word order makes it difficult to relate findings back to the original documents.

We present corpustools, an R package that overcomes these limitations by using indexed lists of tokens (words) rather than a document-term matrix and that allows for better search and network analysis as well as for inspecting LDA results. Integration with tm and quanteda is provided.