Isaac Newton Institute for Mathematical Sciences

Neural Networks and Machine Learning

July - December 1997

Organisers: C M Bishop (Aston), D Haussler (UCSC), G E Hinton (Toronto), M Niranjan (Cambridge), L G Valiant (Harvard)

A Newton Institute Themed Week

ON-LINE LEARNING IN NEURAL NETWORKS

17 - 21 November, 1997

Isaac Newton Institute, Cambridge, U.K.

Organisers: D Saad (Aston) and C M Bishop (Microsoft)

Abstracts

Chris Van den Broeck
Off-line Versus On-line Learning in an Unsupervised Scenario

We study both on-line and off-line learning in the following unsupervised learning scheme: p patterns are sampled independently from a distribution on a N-dimensional sphere with a single symmetry breaking orientation. Exact results are obtained in the limit $p \rightarrow \infty$ and N \rightarrow \infty$ with finite ratio p/N. One finds that for smooth pattern distributions, the asymptotic behavior of the optimal off-line and on-line learning are identical, and saturate the Cramer-Rao inequality from statistics. For discontinuous pattern distributions on the other hand, the optimal on-line algorithm needs (at least) twice as many examples asymptotically to reach the optimal off-line performance. When the average overlap of the patterns along the symmetry breaking direction is zero, one observes retarded learning in off-line and a learning plateau in on-line. Since the off-line result is the best that can be achieved, we conclude that the plateau has an minimal intrinsic length. We discuss the relevance of this results with respect to the problem of plateaus in the multi-layer networks.


Leon Bottou (AT&T, USA)
Stochastic Approximations and Online Learning

This talk shows how the stochastic approximations concepts can be applied to online learning. Examples will be given for both supervised and unsupervised online learning algorithms. I will then discuss the convergence of online learning algorithms and give results that apply to very broad classes of algorithms. During this discussion, I will address several practical issues, like the definition of the learning rate schedule, the desirable properties of the objective function, and the search for a "good local minimum".


Todd Leen
1. A Perturbation Expansion for Weight Space Probability Densities
(with Genevieve Orr)

The small noise expansion of the Master equation for stochastic gradient algorithms is often truncated at lowest order to obtain an approximation for the weight space probability density. We construct a perturbation series for the density that provides a description beyond this lowest-order (Gaussian) approximation. The series develops the density in terms of Hermite polynomials times Gaussians, thus recalling the Gram-Charlier expansion of classical statistics. We give examples of asymptotic densities and moments for fixed gain learning.

2. Ensemble Dynamics of Stochastic Approximation: An Exactly Solvable Model
(with John E. Moody)

Theoretical descriptions of the ensemble dynamics of stochastic optimization rely on truncated expansions of the master equation, or numerical solution to the time evolution in the case of the order parameter approach. Even for very simple examples, such as the linear LMS algorithm of adaptive signal processing, the time evolution cannot be predicted exactly. We have found that a variant of stochastic descent that uses the sign of the gradient, called Manhattan learning, gives rise to a master equation that can be integrated in closed form, thus providing an exactly solvable model.


Noboru Murata (RIKEN, Japan)
Statistical Analysis of On-line Learning

To clarify the ability of on-line learning, we apply statistical method and study the cases with fixed and 1/t^a -type annealing learning rate. In the case that the learning rate is annealed as 1/t, it can be shown for the asymptotic regime that on-line learning can minimize the generalization error with the same rate as batch learning. Based on this result and extending Sompolinsky-Barkai-Seung's algorithm, we propose an adaptive learning rate algorithm. Experimental results will be given by Klaus R. Mueller.


Klaus-Robert Müller (GMD FIRST, Berlin), Noboru Murata, Andreas Ziehe and Shun-ichi Amari
Applying On-line Learning to Non-stationary Blind Separation

In blind separation problems it often happens that the rule changes over time, either abruptly as a switch or slower as a drift. In this scenario on-line learning provides very useful techniques to adapt to the changing environment. Experimental results on the application of on-line learning to non-stationary blind separation problems are presented. Some problems in the course of using on-line learning in real world applications are discussed.


Siegfried Boes and Shun-ich Amari (RIKEN, Japan)
The Role of the Learning Rate in On-line Learning

The choice of the learning rate is of central importance in on-line learning. A time-independent, fixed learning rate is in most cases only a compromise. When learning starts it should be large enough to get a fast convergence, later it should be small enough to reach the optimum. A time-dependent, annealed learning rate seems to be the solution. However, in multi-layer networks the symmetric phase can not be avoided by the use of an annealed learning rate. And in the case of nonstationarities, an adaptive learning rate is preferable. In the talk these problems will be discussed and for several models (committee machine, nonlinear and linear perceptron) solutions will be presented.


Magnus Rattray and David Saad (Aston University, UK)
Optimal On-line Learning for Multilayer Neural Networks

We consider the problem of determining optimal parameters and, more generally, optimal rules for on-line learning in multilayer feed-forward neural networks. This work employs a recently developed statistical mechanics description of the learning process which is exact for large input dimension. We minimize the total change in generalization error over the entire learning process using a variational calculation, leading to optimal conditions for learning with respect to the relevant parameter (or rule). We show how the consideration of global effects is critical, since optimizing the rate of change in generalization error does not lead to good performance in general.


David Saad and Magnus Rattray (Aston University, UK)
Incorporating Curvature Information in On-Line Learning

Several approaches have been suggested over the years for incorporating curvature information in on-line learning scenarios for speeding up training. Among the methods suggested are the natural gradient and the matrix momentum methods; the former requires inverting the Hessian matrix while the latter avoids an explicit inversion and can be used with efficient single step estimates. We employ the order parameter approach to investigate the learning dynamics in the case of multilayer networks and to examine the efficacy of these methods for the various stages of learning. The results suggest that neither of these methods is very useful during the symmetric phase and that matrix momentum, a local cost-effective alternative to global methods, is rather sensitive to the choice of training parameters.


Manfred Opper (Aston University, UK)
A Bayesian Approach to Online Learning

The exact computation of the posterior distribution in Bayesian inference usually requires the storage of all data. Replacing the exact posterior by distributions which depend on a small number of parameters (like Gaussians) allows an approximate update of the posterior in an online fashion. For smooth models, when the number of observations is large, asymptotically no information is lost in the online algorithm and efficient estimation can be achieved.


Sara A. Solla (Northwestern and CONNECT) and Ole Winther (CONNECT, The Niels Bohr Institute)
Optimal Bayesian On-line Learning

In a Bayesian approach to online learning, a simple parametric approximate form for the posterior distribution over weight space is updated after the presentation of each example. Predictions on new data follow from averages over this parametric posterior. This approach is to be compared to the batch or offline Bayesian optimal approach, based on an accurate calculation of the posterior distribution in terms of the prior and the likelihood of the full training set. We conjecture that the performance of the online algorithm can be optimized by minimizing at each time step the Kullback-Leibler distance between the parametric distribution and the corresponding batch distribution. The implications of this approach to online Bayesian learning are investigated for some simple learning scenarios.


Luis B. Almeida (IST and INESC, Portugal)
Adapting Parameters in On-line Learning

Optimization is an essential operation in many domains. A typical case is the training of neural networks and other similar models. Most optimization methods involve parameters that have to be chosen a priori or adapted during the procedure itself. An example is gradient descent/ascent, which involves one or more step size parameters. These parameters can be fixed, but they can also be made adaptive, usually resulting in significant gains in speed and robustness.

Parameter adaptation procedures typically have been developed only for batch-mode (deterministic) optimization methods. In this talk a derivation will be made of a parameter adaptation procedure for on-line (stochastic) optimization methods. The class of optimization methods that is involved is relatively wide, and includes on-line gradient optimization as a special case. The class of functions that can be optimized is rather wide: no locally quadratic approximation is necessary.

Examples will be given of the application of the procedure to the optimization of relatively simple benchmark functions, and also to step size adaptation in the training of multilayer perceptrons with on-line backpropagation. The results put into evidence the gains in speed that can be obtained.


Enno Schlösser and Michael Biehl (Würzburg, Germany)
The Dynamics of On-line Principal Component Analysis

The learning dynamics of an on--line algorithm for principal component analysis is exactly described in the thermodynamic limit by means of coupled ordinary differential equations for a set of order parameters. It is demonstrated that learning can is delayed significantly due to the necessary breaking of symmetries among the students. The same effect can lead to a perfect or partial loss of initial knowledge in the course of learning. The analysis shows that the use of


Mauro Copelli (Limburgs, Belgium), M. Biehl, N. Caticha, O. Kinouchi, R. Eichorn, R. Simonetti and P. Riegler
Noise Robustness in Boolean Machines

Within the Statistical Mechanics approach, supervised online learning of a boolean function is studied. We restrict ourselves to committee and parity machines with tree-like structure, in a scenario where both the 'student' (hypothesis) and the 'teacher' (rule) have the same architecture. In the thermodynamic limit a variational method suggests an algorithm which maximizes the student generalization ability. Besides giving upper bounds for the performance of such machines, these optimal algorithms can eventually be implemented if some quantities it requires can be estimated (e.g. the generalization error itself, or the noise level in the examples). We focus on the consequences of misestimation of the noise level in the system and present results in terms of noise robustness phase diagrams. In the space of "true" vs. "estimated" noise level, the optimal algorithm has a large robust phase where the rule can be perfectly inferred in the limit of large number of examples. However, a worse misestimation may lead to partial or total loss of generalization.


Yoshiyuki Kabashima (Tokyo Institute of Technology, Japan)
Learning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without Queries

In this seminar, I will talk about on-line learning of a simple learning problem in which a decision boundary is estimated from stochastic examples. For this problem, two learning algorithms which work with and without queries are proposed and their learning performance is investigated analytically.


Tom Heskes (Nijmegen, The Netherlands)
On-line learning with correlated patterns

I discuss two approaches to study learning with correlations between successive patterns. The stochastic description is useful for networks with a finite number of inputs and a relatively small learning parameter. Especially if there are plateaus in the energy landscape, the effect of correlations can be quite dramatic. The statistical mechanics framework treats networks with a huge number of inputs. Computing the effect of correlations within this framework appears to be surprisingly simple. The effect itself, however, is much smaller than in the case of a finite number of inputs.


David Barber (Nijmegen, The Netherlands) and Peter Sollich (Edinburgh, UK)
Online Learning from Finite Training Sets: Soft Committee Machine

Online learning is one of the most common forms of neural network training. We present an analysis of online learning from finite training sets for non-linear networks (namely, soft-committee machines), advancing the theory to more realistic learning scenarios. Dynamical equations are derived for an appropriate set of order parameters; these are exact in the limiting case of either linear networks or infinite training sets. Preliminary comparisons with simulations suggest that the theory captures some effects of finite training sets, but may not yet account correctly for the presence of local minima.


ACC Coolen and David Saad
Theory of On-Line Learning with Restricted Training Sets

We develop a macrocopic theory to describe supervised on-line learning in neural networks with restricted training sets. Here the pool of training examples is of size p=alpha.N with finite alpha. This has the following consequences: (i) the field distribution P(x,y) is no longer Gaussian (ii) the training- and generalization errors are not identical (iii) the standard formalism (developed for alpha=infinity) with two dynamic order parameters (Q,R) is no longer applicable

We use dynamical replica theory to solve the finite alpha case, involving a dynamical order parameter which is a function and a saddle-point problem to be solved at each time-step. For simplicity and transparency we restrict ourselves here to simple binary perceptrons and noise-free teachers (generalization to multi-layer systems and noisy teachers is in principle straightforward and ongoing). We demonstrate how the simple (Q,R) theory is incorporated as a special case for alpha->infinity, we show how for simple learning rules (such as the Hebbian rule) the saddle- point problem can be solved analytically, and we present some preliminary comparisons between theory and numerical simulations.


This workshop will form a component of the Newton Institute programme on Neural Networks and Machine Learning, organised by C M Bishop, D Haussler, G E Hinton, M Niranjan and L G Valiant.

Copyright © Isaac Newton Institute