Quantitative text analysis using R (Kenneth Benoit)
This workshop will introduce quantitative text analysis and natural language processing using the R language, specifically the quanteda package developed by Kenneth Benoit et al. but also covering other major tools in the R ecosystem for text analysis (e.g.stringi). Topics would include how to perform common text analysis and natural language processing tasks using R. Contrary to a belief popular among some data scientists, when used properly, R is a fast and powerful tool for managing even very large text analysis tasks. I would demonstrate how to format and input source texts, how to structure their metadata, and how to prepare them for analysis. This includes common tasks such as tokenization, including constructing ngrams and “skip-grams”, removing stop words, stemming words, and other forms of feature selection. I will also show to how to tag parts of speech and parse structural dependencies in texts. For statistical analysis, I will show how R can be used to get summary statistics from text, search for and analyze keywords and phrases, analyze text for lexical diversity and readability, detect collocations, apply dictionaries, and measure term and document associations using distance measures. We also cover how to pass the structured objects from quanteda into other text analytic packages for doing topic modelling, latent semantic analysis, regression models, and other forms of machine learning. See https://quanteda.io for more details.