skip to content

Coresets for scalable Bayesian logistic regression

Presented by: 
Tamara Broderick Massachusetts Institute of Technology
Tuesday 4th July 2017 - 13:30 to 14:15
INI Seminar Room 1
Co-authors: Jonathan H. Huggins (MIT), Trevor Campbell (MIT)

The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. However, standard Bayesian inference algorithms are computationally expensive, so their direct application to large datasets can be difficult or infeasible. Rather than modify existing algorithms, we instead leverage the insight that data is often redundant via a pre-processing step. In particular, we construct a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We then input this small coreset to existing posterior inference algorithms without modification. To demonstrate the feasibility of this approach, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset -- both for fixed, known datasets, and in expectation for a wide class o f data generative models. Our approach permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size.
The video for this talk should appear here if JavaScript is enabled.
If it doesn't, something may have gone wrong with our embedded player.
We'll get it fixed as soon as possible.
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons