Tuesday, April 10-Friday April 13
Monday, April 16-Friday, April 20
Two lectures/workshops each morning (9:30-10:50; 11:00-12:40) -- mandatory
Lab session in the afternoon (13:30-15:10) -- optional
Classes will be held in Nádor 13: Rooms 302 and 303*
*except Thursday, April 18: Rooms 301 and 302
The goal of this course is to provide training in text mining simultaneously on two distinct levels. The material from both levels should be accessible and useful to social scientists as well as humanists, as strategies such as text mining, content analysis, sentiment analysis and entity extraction are becoming fundamental to research on large and diverse digital corpora.
The basic level is offered with no prerequisites, and is designed for students from the humanities or from qualitative social science backgrounds who are interested in learning the fundamentals of text analysis using computational methods. We will begin with typical source materials for qualitative research (archival records, interviews, print and online media, primary and secondary literature) and demonstrate how to collect and curate a full-text, machine-readable corpus, extract and standardize metadata, and then analyze and visualize the text. The last section of the course will look closely at how such techniques can be integrated with non-computational methods to create a balanced and nuanced analysis, which is informed by the ‘distant reading’ but does not sacrifice the complexities offered by close reading.
The more advanced level of the boot camp is designed for students who have some exposure to statistically-informed methods and query languages (SQL, python, R) and would like to apply these methods to the computational and statistical analysis of texts. Prerequisites will include Introduction to Statistics (or equivalent) and Introduction to R (or equivalent). We will also take students through the process of corpus creation, but at a much faster pace, as we will assume basic familiarity with scraping, OCR, and dataset curation. The more advanced level will offer more specific training in stylometry, entity extraction, and other features of natural language processing.
- corpus selection and cleaning
- metadata collection
- research question design
- basic programming for textual analysis (R software environment and relevant packages: TM, stylo, ggplot2, topicmodels, klaR, etc.)
- applied statistical evaluation of results
- topic modeling
- stylometry (authorship attribution and forensic authorship analysis)
- frequency analysis and genre
- classification, variable selection and discriminant function analysis
- natural language processing, part of speech tagging, named entity recognition