Principles of Text Analysis

Patrick van Kessel

Nov 18th, 2020

Principles of Text Analysis, presented by Patrick van Kessel

PDHP resumes our 2020 workshop series on Nov. 18th, with a workshop entitled Principles of Text Analysis, presented by Patrick van Kessel, senior data scientist at Pew Research Center.  This half-day workshop is geared toward data analysts with unstructured text data (e.g. open-ended survey responses or web-curated text), and will provide a tutorial on cleaning, processing, and analyzing data from text-based sources using state-of-the-art text analytics techniques primarily using Python, with some examples also provided in R (experience with either of these languages is recommended but not required).

Topics include:

  • Preprocessing and cleaning messy text data
  • Feature extraction using TF-IDF vectorization
  • Text analytics techniques including topic modelling and unsupervised clustering methods
  • Software demonstration featuring the scikitlearn library for Python

Slides & Lab Materials:

Slides are available here


-Demos for this workshop are conducted using Google’s Colab online Python notebook (no installs required):

Those that wish to install software and run locally will need the following:
Python version 3+ (required)

–Clone Patrick’s Github repository and install Python libraries numpyscipyscikit-learnpandasmatplotlibnltkstatsmodels, and jupyter (all required; install code below — requires pip)

git clone
cd text-analysis-workshop
pip3 install -r requirements.txt
python3 -m jupyter notebook