Tutorials - The COMPTEXT Association

Pre-conference Events

A. Quantitative Text Analy sis for Absolute Beginners (Part 1) Room 201

About the tutorial

How can I import texts in various formats into R? What is a text corpus? What is the difference between tokens and types? How do I construct a document-feature matrix to conduct quantitative analyses of textual data? What are common techniques to classify or scale large amounts of texts? The tutorial on “Quantitative Text Analysis for Absolute Beginners” will provide answers to these questions.

The workshop provides a hands-on introduction to quantitative text analysis using quanteda (https://quanteda.io) and related R packages. The following topics will be covered in the workshop:

– From raw texts to a corpus: how to import textual data and prepare it for analysis.

– Tokenization, feature selection, and creating a document-feature matrix.

– A primer on classification, scaling, and topic models.

The applied elements of the workshop will make use of the R programming language. Prior knowledge of text analysis is not required. Participants without any prior experience with R are encouraged to read Chapter 1 of the quanteda tutorials (https://tutorials.quanteda.io).

About the instructor

Stefan Müller is a Postdoctoral Researcher in the Department of Political Science at the University of Zurich. He received his PhD in Political Science from Trinity College Dublin, and is a core contributor to the quanteda R package. His research focuses on the interactions between parties, voters, and the media. Additionally, he is involved in projects on coalition prediction, public opinion, legislative behavior, the incumbency advantage, and political compromise. He develops and validates user-friendly tools for the efficient and reliable combination of human coding and machine learning.

B. Machine Learning using Text Data Room 202

About the tutorial

This workshop introduces social scientists to machine learning using text data. The objective is to take the content of text — e.g. politician speeches, newspaper articles, or judicial decisions — and form predictions of an associated outcome — e.g. topics, partisanship, or citations.

The course will take participants through all steps of a machine learning project:

1) Extracting predictive, interpretable, and tractable features from text;

2) Selecting the right model for regression or classification;

3) Tuning model hyperparameters;

4) Evaluating model performance;

5) Explaining model predictions;

6) Using the predictions for empirical analysis.

Throughout, there will be repeated emphasis on research questions and research design. Code examples will be provided in Python and in R.

About the instructor

Elliott Ash is Assistant Professor at ETH Zurich Department of Social Sciences, where he chairs the Law, Economics, and Data Science Group. Professor Ash’s research undertakes empirical analysis of law and political economy, with methods drawn from applied microeconometrics, natural language processing, and machine learning. Professor Ash was previously Assistant Professor of Economics at University of Warwick, and before that a Postdoctoral Research Associate at Princeton University. He received a PhD in economics and JD from Columbia University, a BA in economics, government, and philosophy from University of Texas at Austin, and an LLM in international criminal law from University of Amsterdam.

A-2, F & C (13:00-15:00)

A. Quantitative Text Analysis for Absolute Beginners (Part 2) Room 201

F. Classification Challenge Room 202

About the tutorial

Political scientists have long been interested on how parliamentarians distribute their attention to different policy issues in order to measure the extent in which they follow constituents’ interests and/or attention change over time. The Policy Agendas Project (John, 2006; Baumgartner, Green-Pedersen & Jones. 2006) has been tracking changes in policy activity systematically within particular areas of policymaking over longer periods of time. They have developed a codebook with 21 major topics and more than 200 subtopics, used for coding of political texts for over 18 countries. This coded data has enables policy agenda studies in a comparative fashion for over two decades (Baumgartner, Breunig & Grossman 2019).

Multiple country level datasets are available for different sources and multiple periods of time. However, manual coding is extremely time-consuming and costly and new solutions are needed to scale this type of research. In this challenge, you will face a supervised machine learning problem by developing classification models in different languages. In particular, we expect you to apply the tools of machine learning to predict the topics of parliamentary interventions. The table below shows the relevant practice skills:

Beginner

Text pre-processing – Quanteda
Bag-of-words models
Multiclass classification
R basics

Advanced

Text pre-processing
Natural Language Processing
Neural networks
Multiclass classification
R

In the first two hours of the workshop, participants will develop prediction models. For each parliamentary intervention in the test set, they must predict on which one of the 21 topic classes of the CAP codebook it belongs. Models will be scored on the percentage of oral questions that are correctly predicted (accuracy). In the last hour of the workshop, model performance will be compared, and participants will present their models. A group discussion on the best practices and most common problems will wrap-up the session. Four manually labelled country datasets in multiple languages will be used in the workshop.

About the instructor

Camilo Cristancho is a Juan de la Cierva post-doctoral research fellow at the Universitat de Barcelona. He is member of the research groups on the Quality of democracy and Democracy, Elections and Citizenship, where he studies political attitudes and behaviour. He works with computational social science, statistical and experimental methods.

C. Data Collection and Text Analysis in the Cloud Room 203

About the tutorial

In this workshop we will introduce the use of Virtual Machines for data collection of text from social media sources as well as its use for text analysis of large quantities of text. We will define a pipeline that will allow us to collect data in a reliable and secure way without the need for constant supervision of the user, making use of the tools provided by the cloud service.

The topics covered in the workshop will be:

Introduction to Virtual Machines, and Google Cloud Platform, which will include:
- Creating our first Virtual Machine in Google Cloud Services.
- Installing Python and MongoDB for data collection in our VM.
- Creating a service to run in the background for silent -and reliable- data collection from Twitter streaming API.
- Creating scheduled jobs for data collection of GoogleNews aggregator.

Text analytics in the cloud Using Google Cloud NLP, which will cover:
- Data loading from our VMs
- Sentiment analysis in the cloud
- Entity recognition in the cloud
- Content analysis in the cloud

To take full advantage of the workshop, participants must have a Google account and they must activate their Google Cloud Services accounthttps://cloud.google.com under the free trial schema. Also, the participants must have a Twitter Developer account https://developer.twitter.com/en/apply-for-access. Finally, previous knowledge of the Bash shell or any other Terminal shell is recommended as well as previous knowledge in programming.

About the instructor

Daniel Valdenegro is a PhD student in Computational Social Science at the University of Leeds and former Data Analyst and Junior Researcher at the Social Psychology Lab of the Pontifical Catholic University of Chile. He is passionate about data analysis and quantitative social research methodology, with experience working with R, Python and JavaScript. His research interests are in the use of “big data” from digital sources -such as social media, IoT or just general digital footprint- to model human behaviour. His current PhD project attempts to use the public digital footprint on social media of populations that are undergoing periods of social unrest, to extract their general emotional pattern, all this to build a predictive model of activism based on theses parameters.

A-3, E & D (15:00-17:00)

A. Quantitative Text Analysis for Absolute Beginners (Part 3) Room 201

E. Advanced Topic Modeling Room 202

About the tutorial

The workshop will introduce methods for automatically extracting topics present in documents and deriving latent patterns reflecting the structure of a corpus. Topic modelling is an unsupervised approach for finding and discovering the word sets (i.e. “topics”) in large collections of texts. Topic models are used for document clustering, organizing large blocks of textual data, extracting information from unstructured text and feature selection. After a brief introduction to the concepts such as semantic similarity and logic of topic modelling, the participants will follow a hands-for on approach applying the method for their corpus. The participants are encouraged to bring their data, but for convenience a toy corpus will be provided for those who do not have data. The major technique followed in the workshop is LDA (Latent Dirichlet Allocation) with Mallet. Steps include: loading and preparing data, cleaning and pre-processing the texts, exploratory analysis, preparing the data for LDA analysis, LDA model training, topic model diagnostics and analysing/interpreting LDA results. Other topic modelling techniques such as Non- Negative Matrix Factorization and Latent Semantic Indexing will be also be briefly introduced for comparison purposes.

Course Prerequisites: 1) Basic familiarity with RStudio and data management and data visualization with R, 2) Basic familiarity with text analysis in R applications such as quanteda, tm or the like would be useful but not required. 3) Participants need to bring their own laptops with Rstudio tidyverse, quanteda and java, mallet, installed and the environment set for java.

About the instructor

Ahmet Suerdem is a Professor in Business Administration Dept. Istanbul Bilgi University and a Senior Academic Visitor at the London School of Economics and Political Science. He has taught various courses on social network analysis, text analysis and qualitative and quantitative methodology in different institutions. He is currently working on the international research project MACAS (mapping the cultural authority of science) for developing routines for science news corpus construction from on-line sources and for the automatic analysis of thematic contents. He is also experimenting with the operationalisation of higher order text intuitions (semantic and discursive levels).

D. Text Mining and Machine Learning with Apache Spark Room 203

About the tutorial

Apache Spark is currently one of the most popular open-source cluster-computing frameworks. With its Machine Learning Library (MLlib) it supports the easy scaling of a range of feature extraction and machine learning tasks commonly employed in text mining. Furthermore, it works with both Python and R.

The tutorial will first cover the basics of using an Apache Spark cluster for text mining and machine learning, and will then provide a walk-through of the text classification solution developed within the framework of the Hungarian leg of the Comparative Agendas Project – with the support of the MTA SZTAKI Cloud team – as a use case example of the possibilities opened up by the increased speed offered by parallel computing.

The tutorial will address among other things: a) configuring the Apache Spark cluster, b) using a Hadoop Distributed File System with the cluster, c) operating the cluster via an RStudio Server and sparklyr (the Spark interface for R developed by RStudio), and d) the differences in available functionality of the Machine Learning Library for sparklyr, SparkR (the R API developed by Apache Spark) and PySpark (the Python API for Spark).

About the instructor

Zoltan Kacsuk holds a doctoral degree from Kyoto Seika University. He is a postdoctoral researcher at the Japanese Visual Media Graph project, Institute for Applied Artificial Intelligence, Stuttgart Media University, and is also a part-time research fellow at the Department of Government and Public Policy, Institute for Political Science, HAS Centre for Social Sciences.