Deduplicating databases of deaths in war: advances in adaptive blocking, pairwise classification, and clustering

Presented by: 
Patrick Ball Human Rights Data Analysis Group, Human Rights Data Analysis Group
Monday 12th September 2016 - 12:00 to 12:30
INI Seminar Room 1
Violent inter-state and civil wars are documented with lists of the casualties, each of which constitutes a partial, non-probability sample of the universe of deaths. There are often several lists, with duplicate entries within each list and among the lists, requiring record linkage to dedeuplicate the lists to create a unique enumeration of the known dead.

This talk will explore how we do record linkage, including: new advances in generating and learning from training data; an adaptive blocking approach; pairwise classification with string, date, and integer features and several classifiers; and a hybrid clustering method. Assessment metrics will be proposed for each stage, with real-world results from deduplicating more than 420,000 records of Syrian people killed since 2011.

