Transposably invariant sample reuse: the pigeonhole bootstrap and blockwise cross-validation
Seminar Room 1, Newton Institute
Sample reuse methods like the bootstrap and cross-validation are widely used in statistics and machine learning. They provide measures of accuracy with some face value validity that is not dependent on strong model assumptions.
These methods depend on repeating or omitting cases, while keeping all the variables in those cases. But for many data sets, it is not obvious whether the rows are cases and colunns are variables, or vice versa. For example, with movie ratings organized by movie and customer, both movie and customer IDs can be thought of as variables.
This talk looks at bootstrap and cross-validation methods that treat rows and columns of the matrix symmetrically. We get the same answer on X as on X'. McCullagh has proved that no exact bootstrap exists in a certain framework of this type (crossed random effects). We show that a method based on resampling both rows and columns of the data matrix tracks the true error, for some simple statistics applied to large data matrices.
Similarly we look at a method of cross-validation that leaves out blocks of the data matrix, generalizing a proposal due to Gabriel that is used in the crop science literature. We find empirically that this approach provides a good way to choose the number of terms in a truncated SVD model or a non-negative matrix factorization. We also apply some recent results in random matrix theory to the truncated SVD case.
- http://stat.stanford.edu/~owen/reports - Page with research articles
- http://stat.stanford.edu/~owen/reports/cvsvd.pdf - Technical report on bi-cross-validation (in revision)
- http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&page=toc&handle=euclid.aoas/1196438015 - AOAS page with link to Pigeonhole boostrap paper
If it doesn't, something may have gone wrong with our embedded player.
We'll get it fixed as soon as possible.