COMPTEXT 2020,
Innsbruck
The 3rd Symposium on the Quantitative Analysis of Textual Data
15-16 May, 2020
University of Innsbruck, Innsbruck, Austria
 

Tutorial A – 14 May, 2020 9:30 am to 1 pm

Quantitative Text Analysis for Absolute Beginners: From Texts to a Document-Feature Matrix (Stefan Müller)

About the tutorial:

The workshop provides a hands-on introduction to quantitative text analysis using quanteda (https://quanteda.io) and related R packages. In the first part of the workshop, participants will first learn and apply how to import texts in various formats into R. Afterwards, we describe the functionality of text corpus, explain the difference between types and tokens, and reshape the level of texts from documents to sentences or paragraphs. Afterwards, we turn to tokenization and discuss ways of selecting, removing tokens as well as detecting and compounding multi-word expressions. Finally, we will construct a document-feature matrix for the quantitative analysis of textual data.

The applied elements of the workshop will make use of the R programming language. Prior knowledge of text analysis is not required. Participants without any prior experience with R are encouraged to read Chapter 1 of the quanteda tutorials (https://tutorials.quanteda.io). We strongly recommend attending both sessions of the workshop “Quantitative Text Analysis for Absolute Beginners”.

Stefan Müller is an Assistant Professor and Ad Astra Fellow in the School of Politics and International Relations at University College Dublin. He is a founding member of the Connected_Politics Lab at University College Dublin, core contributor to the quanteda R package, and Training Advisor of the Quanteda Initiative CIC. Stefan’s research focuses on the interactions between political parties, voters, and the media. He develops and validates user-friendly tools for the efficient and reliable combination of human coding and machine learning.

Tutorial B – 14 May, 2020
2 pm to 5:30 pm

Quantitative Text Analysis for Absolute Beginners: Textual Statistics, Scaling, and Classification (Kenneth Benoit)

About the tutorial:

The second part of the workshop “Quantitative Text Analysis for Absolute Beginners” provides an overview of textual statistics, such as readability, text similarity, keyness, and lexical diversity. Moreover, participants will get to know and apply textual scaling models, such as Wordscores and Wordfish. The last part of the tutorial introduced supervised machine learning which leverages human coding to classify large amounts of unlabelled texts.

The applied elements of the workshop will make use of the R programming language. Prior knowledge of text analysis is not required. Participants without any prior experience with R are encouraged to read Chapter 1 of the quanteda tutorials (https://tutorials.quanteda.io). We strongly recommend attending both sessions of the workshop “Quantitative Text Analysis for Absolute Beginners”.

Kenneth Benoit is Professor of Computational Social Science in the Department of Methodology at the London School of Economics and Political Science. His current research focuses on computational, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Kenneth Benoit is the creator of the quanteda R package and Managing Director and Founder of the Quanteda Initiative CIC.

Tutorial C – 14 May, 2020 9:30 am to 1 pm

Introduction to Semi-supervised Document Classification and Scaling  (Kohei Watanabe)

About the tutorial:

This tutorial introduces participants to semi-supervised techniques for document classification and scaling. Unsupervised and (full) supervised techniques have been widely used in quantitative text analysis, but they often produce output that has little theoretical relevance or demand input that is too large for researchers to create manually. Semi-supervised techniques enable researchers to perform theory-driven analysis at minimal costs using a small set of keywords call “seed words”. The instructor will explain how to pre-process data and select seed words to gain the best result. The semi-supervised models that are covered in this tutorial will be seeded-LDA, Newsmap, Latent Semantic Scaling (LSS). These models are especially useful for researchers who analyse large and complex textual data in languages or disciplines which has only few lexical resources (e.g. dictionary).

Software: 

R, quanteda, topicmodels, newsmap, LSS

Prior knowledge:

Participants should have experience in supervised (e.g. Wordscores, Naïve Bayes, SVM etc) or unsupervised (e.g. Wordfish, topic models) document classification and scaling.

Literature:

Watanabe, K. (2017). Measuring news bias: Russia’s official news agency ITAR-TASS’ coverage of the Ukraine crisis. European Journal of Communication, 32(3), 224–241. https://doi.org/10.1177/0267323117695735

Kohei Watanabe is senior assistant professor at Political Science Department and Digital Science Center (DiSC) of the University of Innsbruck. Before coming to Austria, Kohei worked at Waseda Institute for Advanced Study (WIAS) of Waseda University and the Department of Methodology and the Department of International Relations of the London School of Economics and Political Science (LSE). He is also an affiliated researcher at Waseda University, a member of a project on popular mobilization in Russia at the LSE’s International Relations Department, and a main contributor to a quantitative text analysis package in R.

Tutorial D – 14 May, 2020
2 pm to 5:30 pm 

Word embeddings (Dong Nguyen) 

About the tutorial:

Word embeddings have radically changed the field of NLP and are also increasingly used as first class research objects to study social and linguistic questions. This tutorial will introduce participants to word embeddings and cover the following topics:

– What are word embeddings?
– A few popular approaches to train word embeddings
– Design decisions
– Approaches to analyze and evaluate word embeddings
– Caveats when using word embeddings
– Example applications: analyses of linguistic change and analyses of biases in embeddings
– A high-level introduction to contextual word embeddings

Software:
Code examples/exercises will be provided in Python and or R.

Prior Knowledge:
The tutorial  will be accessible to participants with a range of backgrounds, although it is recommend that participants have basic knowledge of linear algebra (e.g. vectors and vector operations) and some experience with text analysis.

Literature:
There are no required readings for preparation, but for participants with no prior knowledge of word embeddings the following could be helpful for preparation: chapter 6 “Vector Semantics and Embeddings” of Speech and Language Processing 3rd ed. by Jurafsky and Martin,
“Contextual Word Representations: A Contextual Introduction”, Noah A. Smith, 2019 (https://arxiv.org/abs/1902.06006), and “The Illustrated Word2vec” http://jalammar.github.io/illustrated-word2vec/ by Jay Alammar.

Dong Nguyen works as assistant professor at Utrecht University. Previously, Dong was a research fellow at the Alan Turing Institute. She was also affiliated with Edinburgh University. Dong completed her Ph.D. at the University of Twente. She received a master’s degree from the Language Technologies Institute at Carnegie Mellon University and a bachelor’s degree in Computer Science from the University of Twente. Dong has interned at Facebook (fall 2011), Microsoft Research (fall 2013), and Google (summer 2014). In fall 2015 Dong visited Georgia Tech.

Tutorial E – 14 May, 2020
2 pm to 5:30 pm 

Collecting and Processing Text Data from Online Sources  (Theresa Gessler)

About the tutorial: 

The increasing availability of large amounts of data is changing research across the social sciences. Over the past years, a variety of data – whether election results, press releases, parliamentary speeches or social media posts – has become available online. Although data has become easier to find, in most cases, it comes in an unstructured format. This makes collecting, cleaning and analysing this data challenging.

In this tutorial, we will practice basic techniques for scraping and processing text data from the web. The goal is to equip you to gather your own data and process it in R for text analysis. Next to scraping, we also discuss techniques for extracting and processing the text data into a format that is useful for various text analysis techniques.

While we will stick to the basics, we will discuss some common challenges as well as which more advanced techniques help with tackling them.

Software: R. You can install most packages on the spot. Install rvest, stringr and dplyr in advance for a head start.

Prior knowledge: Familiarity with the R programming language

Theresa Gessler is postdoc in political science at the University of Zurich, working on conflicts around democracy, immigration and patterns of party competition. Next to classical political science methods, Theresa uses text-as-data, webscraping and various types of digital trace data in her research. One of her papers has recently won the Best Presentation Award at the European Symposium on Societal Challenges in Computational Social Science and another paper has recently been published as accepted article by the European Journal of Political Research.