Principles of Text Analysis
Patrick van Kessel
Nov 18th, 2020
PDHP resumes our 2020 workshop series on Nov. 18th, with a workshop entitled Principles of Text Analysis, presented by Patrick van Kessel, senior data scientist at Pew Research Center. This half-day workshop is geared toward data analysts with unstructured text data (e.g. open-ended survey responses or web-curated text), and will provide a tutorial on cleaning, processing, and analyzing data from text-based sources using state-of-the-art text analytics techniques primarily using Python, with some examples also provided in R (experience with either of these languages is recommended but not required).
Topics include:
- Preprocessing and cleaning messy text data
- Feature extraction using TF-IDF vectorization
- Text analytics techniques including topic modelling and unsupervised clustering methods
- Software demonstration featuring the scikitlearn library for Python
Slides & Lab Materials:
Software:
-Demos for this workshop are conducted using Google’s Colab online Python notebook (no installs required):
https://colab.research.google.com/github/patrickvankessel/text-analysis-workshop/blob/main/Tutorial.ipynb
Those that wish to install software and run locally will need the following:
–Python version 3+ (required)
–Clone Patrick’s Github repository and install Python libraries numpy, scipy, scikit-learn, pandas, matplotlib, nltk, statsmodels, and jupyter (all required; install code below — requires pip)
git clone https://github.com/patrickvankessel/text-analysis-workshop.git
cd text-analysis-workshop
pip3 install -r requirements.txt
python3 -m jupyter notebook