skip to content

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Presented by: 
Flori Bunea Cornell University
Monday 25th June 2018 - 16:00 to 16:45
INI Seminar Room 1
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations. The model links all cell probabilities, collected in a p x n matrix, via the assumption that this matrix can be factorized as the product of two non-negative matrices A and W. Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in this talk. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to K-dimensional canonical basis vectors. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic set-up when K is unknown. In this talk, we take a different view on anchor word estimation, and on the estimation of the word-topic matrix A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, p, K and document length N, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We illustrate, via simulations, that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
The video for this talk should appear here if JavaScript is enabled.
If it doesn't, something may have gone wrong with our embedded player.
We'll get it fixed as soon as possible.
University of Cambridge Research Councils UK
    Clay Mathematics Institute London Mathematical Society NM Rothschild and Sons