An Introduction to Entity Resolution
(PDHP Record Linkage Workshop Series Part 1 – July 10th, 2019)
Part 1 of a multi-part workshop series on record linkage
July 10th, 2019
The PDHP workshop series resumes July 10th with the first in a multi-part series of workshops on record linkage topics & techniques within social research. Please join Assistant Professor Rebecca C. Steorts, PhD, of Duke University’s Department of Statistical Science, as she presents An Introduction to Entity Resolution, a half-day workshop geared toward population researchers, computational social scientists, statisticians, and data scientists of all experience levels. This hands-on workshop will cover both the theory and practice of probabilistic entity resolution, while demonstrating state-of-the-art techniques using R software and Apache Spark.
Topics include:
- Overview and introduction to entity resolution
- Entity resolution fundamentals (record linkage, de-duplication, blocking, and computational gains)
- Entity resolution evaluation metrics (including precision, reduction ratio, and robustness to tuning parameters)
- Bayesian entity resolution models (including both parametric and nonparametric Bayesian mixture models)
- Hands-on demonstration of state-of-the-art R packages (using blink) and computational gains (using Apache Spark)
Software:
Demos for this workshop are conducted using R and rely upon the user installing a handful of specific R packages and a data package from Github.
–Install R (required)
R packages and example data can be installed using the following code:
## install packages install.packages(c("devtools", "RecordLinkage", "blink", "knitr", "ggplot2", "igraph", "textreuse", "tokenizers", "numbers")) ## install data package devtools::install_git("https://github.com/resteorts/RLdata")
Workshop Slides & Materials:
The 4-hour PDHP workshop is a shortened version of a fuller shortcourse that is available online. Sections presented live at PDHP by Dr. Steorts are denoted with “(Michigan)”.