Machine learning 2

Many physical processes may be described by the following nonlinear, multivariate variable over the data. Depending on how the data is gathered, we can identify two types of learning: batch learning and sequential learning. In the context of batch
learning, the learning problem involves computing an approximation to the function 
and estimating the characteristics of the noise process given a set of I input-output
observations:

In contrast, in the sequential learning scenario, the observations arrive one at a time.
Typical instances of the learning problem include regression, where
 is continuous; classification, where @ corresponds to a discrete group of classes; and nonlinear dynamical system identification, where the inputs and targets correspond to several
delayed versions of the signals under consideration.
The disturbances 
may represent both measurement noise and unknown inputs.
This study will assume that they can be added directly to the output. The basis for
this assumption is that noise in the input together with other system disturbances will
propagate through the system and therefore can be lumped into one single measurement noise term (Ljung, 1987). When introducing sequential Monte Carlo methods in
Chapter 6, this assumption will be weakened by adopting a more general formulation:
In some scenarios, one might, however, be interested in modelling the
distribution of the input data
0   (Cornford et al., 1998; Wright, 1998). This topic,
however, lies beyond the scope of this thesis.
The goal of learning, as posed here, is to obtain a description of the conditional
distribution
0 . As the dimension of this distribution can be very large, it is convenient to adopt a variational approach and project it into a lower dimensional space.
This can be accomplished by introducing a set of parameters 
 leading to the
distribution
0 . For example, if we believe that the data has been generated by
a Gaussian distribution, we only need two sufficient statistics to describe it, namely its
mean and covariance. These statistics can, in turn, be described by a low-dimensional
set of parameters. These parameters will allow us to infer the outputs
@ whenever we
observe new values of the inputs
 .
The regression function of 
is a multivariate, nonlinear and
 is adopted to denote all the observations corresponding to the th output
. To simplify the notation, is equivalent to  That is, if one index does not
appear, it is implied that we are referring to all of its possible values. Similarly,
 is equivalent to
The shorter notation will be favoured. The longer notation will only be invoked to avoid ambiguities and
emphasise certain dependencies.

Introduction 16
possibly time-varying mapping. When the exact nonlinear structure of this mapping
cannot be established
a priori, it may be synthesised as a combination of parametrised
basis functions. That is:
denotes a multivariate basis function. These multivariate basis
functions may be generated from univariate basis functions using radial basis, tensor product or ridge construction methods. This type of modelling is often referred to
as “non-parametric” regression because the number of basis functions is typically very
large. Equation (1.2) encompasses a large number of nonlinear estimation methods
including projection pursuit regression (Friedman and Stuetzle, 1981; Huber, 1985),
Volterra series (Billings, 1980; Mathews, 1991), fuzzy inference systems (Jang and Sun,
1993), generalised linear models (Nelder and Wedderburn, 1972), multivariate adaptive regression splines (MARS) (Denison, 1998; Friedman, 1991) and many artificial
neural network paradigms such as functional link networks (Pao, 1989), multi-layer
perceptrons (MLPs) (Rosenblatt, 1959; Rumelhart et al., 1986), radial basis function
networks (RBFs) (Lowe, 1989; Moody and Darken, 1988; Poggio and Girosi, 1990),
wavelet networks (Bakshi and Stephanopoulos, 1993; Juditsky et al., 1995) and hinging hyper-planes (Breiman, 1993). For an introduction to neural networks, the reader
may consult any of the following books (Bishop, 1995b; Haykin, 1994; Hecht-Nielsen,
1990; Ripley, 1996).
Neural networks can approximate any continuous function arbitrarily well as the
number of neurons (basis functions) increases without bound (Cybenko, 1989; Hornik
et al., 1989; Poggio and Girosi, 1990). In addition, they have been successfully applied
to many complex problems, including speech recognition (Robinson, 1994), hand written digit recognition (Le Cun et al., 1989), financial modelling (Refenes, 1995) and
medical diagnosis (Baxt, 1990) among others. This thesis will consider two types of
neural network architectures: fixed dimension MLPs and variable dimension RBFs.
MLPs have enjoyed a privileged position in the neural networks community because
of their simplicity, approximating power, relation to biological systems and various historical reasons. Figure 1.2 shows a typical two hidden layer MLP with logistic sigmoid
basis functions in the hidden layers and a single output linear neuron. Networks of this
type can be represented mathematically as follows:

containing the weights connecting each input with the
* th neuron. The logistic sigmoid
Introduction 17
function is given by:


If our goal is to perform classification, then it is convenient to employ a logistic funcb
Figure 1.2 Typical multi-layer perceptron architecture.
tion in the output layer. This allows us to interpret the outputs of the network as
probabilities of class membership. Although the MLPs discussed in this thesis exhibit
a feed-forward architecture, they can be easily extended to recurrent schemes by the
addition of multiple feedback connections or tapped delay lines (de Freitas et al., 1996;
Narendra and Parthasarathy, 1990; Puskorius and Feldkamp, 1994; Qin et al., 1992;
Sjoberg, ¤ 1995; Yamada and Yabuta, 1993).
RBF networks tend to be more tractable than MLPs. In these models, the training
of the parameters corresponding to different layers is, to a large extent, decoupled.
Chapters 5 and 7 will discuss an approximation scheme consisting of a mixture of

RBFs and a linear regression term (Holmes and Mallick, 1998). The number of basis
functions will be estimated from the data. Thus, unless the data is nonlinear, the model
collapses to a standard linear model. More precisely, the linear-RBF model
 is given
by:

Introduction 18
where, in this case,
, p ^ p denotes a distance metric (usually Euclidean or
Mahalanobis),
N = denotes the -th RBF centre for a model with RBFs, 5N = 
denotes the -th RBF amplitude and  N =  and  N = 1 =  denotes the linear
regression parameters. Figure 1.3 depicts the approximation model for

and T  . Depending on our a priori knowledge about the smoothness of the mapping,

Figure 1.3 Linear-RBF approximation model with three radial basis functions, two inputs and two
outputs. The solid lines indicate weighted connections.
we can choose different types of basis functions (Girosi et al., 1995). The most common
choices are:

For the last two choices of basis functions, ( will be treated as a user-set parameter.
Nevertheless, the Monte Carlo estimation strategies described in Chapters 5, 6 and 7
can treat the choice of basis functions as a model selection problem. It is possible to
place a prior distribution on the basis functions and allow the Monte Carlo algorithms
to decide which of them provide a better solution.

Introduction 19
1.2 Scope and Contributions of this Thesis
In the example presented earlier, it was shown that the ability of a model to predict accurately with novel data depends on the amount of data, the complexity of the model
and the noise in the data. It was then argued that artificial neural networks provide a
general and flexible nonlinear modelling strategy. From this standpoint, the learning
problem involves estimating the neural network’s parameters, the number of parameters, the type of basis functions and the statistics of the noise. In addition, we might
have to select the most appropriate set of input signals.
A great deal of effort has been devoted to the solution of the parameter estimation
problem. The other problems have received less attention. In contrast, the issues of
noise estimation and model selection will be central to the scope of this thesis. It will be
possible to manage these more demanding tasks by embracing the Bayesian learning
paradigm. Despite the fact that the problems of input variable selection and basis
function selection are not treated explicitly, the solution to these is a natural extension
of the model selection frameworks presented in Chapters 6 and 7.
Another important theme in this thesis is the issue of sequential learning and inference. Sequential training methods for neural networks are important in many applications involving real-time signal processing, where data arrival is inherently sequential.
Furthermore, one might wish to adopt a sequential processing strategy to deal with
non-stationarity in signals, so that information from the recent past is given greater
weight than information from the distant past. Computational simplicity in the form
of not having to store all the data might also constitute an additional motivating factor
for sequential methods.
This thesis proposes the following:
A novel approach to perform regularisation in sequential learning. This approach
establishes theoretical links between extended Kalman filters with adaptive noise
estimation, gradient descent methods with multiple adaptive learning rates and
training methods with multiple smoothing regularisation coefficients.
An expectation maximisation (EM) algorithm to estimate the parameters of an
MLP, the noise statistics and the model uncertainty jointly. The method is applicable to non-stationary parameter spaces.
A robust Bayesian method to estimate, jointly, the parameters, number of parameters, noise statistics and signal to noise ratios of an RBF network. The necessary
computations are performed using a reversible jump Markov chain Monte Carlo
(MCMC) simulation method. In addition, it presents an efficient reversible jump
MCMC simulated annealing strategy to perform global optimisation of RBF net-

Introduction 20
works. Furthermore, it proves the convergence of these algorithms rigorously
2.
The use of particle filters and sequential Monte Carlo (SMC) methods to the neural networks field. In doing so, new SMC algorithms are devised to deal with the
high dimensional parameter spaces inherent to neural network models. These algorithms are suitable for nonlinear, non-Gaussian and non-stationary modelling.
A new and general sequential Monte Carlo approach to perform sequential noise
estimation and model selection. The method is demonstrated on RBF networks.
1.3 Thesis Organisation
The chapters are, to a large extent, self contained and can be read independently. Chapter 2 is of an introductory nature. At the end of each of the main chapters, Chapters
3 to 7, the proposed methods and algorithms are demonstrated on experiments with
synthetic data. In Chapter 8, further tests on a few real problems are presented. A
summary of the thesis follows:
Chapter 2: Learning and Generalisation
This chapter provides a brief review of learning theory from a neural networks perspective. It addresses both the classical and Bayesian approaches. In addition, it introduces
the sequential learning problem.
Chapter 3: Sequential Bayesian Learning with Gaussian Approximations
Sequential learning methods, in particular Gaussian approximation schemes, are introduced in this chapter. It is shown that an hierarchical Bayesian modelling approach
enables one to perform regularisation in sequential learning. Three inference levels
are identified within this hierarchy, namely model selection, parameter estimation and
noise estimation. In environments where data arrives sequentially, techniques such
as cross-validation to achieve regularisation or model selection are not possible. The
Bayesian approach, with extended Kalman filtering at the parameter estimation level,
allows for regularisation within a minimum variance framework. A multi-layer perceptron is used to generate the extended Kalman filter nonlinear measurements mapping.
Several algorithms are described at the noise estimation level, thus permitting the implementation of on-line regularisation. Another contribution of this chapter is to show
2These contributions were strongly motivated by the work of Christophe Andrieu and Arnaud Doucet
(Andrieu and Doucet, 1998b; Andrieu and Doucet, 1998a; Andrieu and Doucet, 1999).

Introduction 21
the theoretical links between adaptive noise estimation in extended Kalman filtering,
multiple adaptive learning rates and multiple smoothing regularisation coefficients.
Chapter 4: Dynamic Batch Learning with the EM Algorithm
This chapter extends the sequential Gaussian approximation framework discussed in
the previous chapter to the batch learning scenario. In it, an EM algorithm for nonlinear
state space models is derived. It is used to estimate jointly the neural network weights,
the model uncertainty and the noise in the data. In the E-step, a forward-backward
Rauch-Tung-Striebel smoother is adopted to compute the network weights. For the
M-step, analytical expressions are derived to compute the model uncertainty and the
measurement noise. The method is shown to be intrinsically very powerful, simple and
stable.
Chapter 5: Robust Full Bayesian Learning with MCMC
This chapter begins the presentation of Monte Carlo methods, a major theme in this
thesis. The reversible jump MCMC simulation algorithm is applied to RBF networks, so
as to compute the joint posterior distribution of the radial basis centres and the number
of basis functions. This area of research is advanced in three important directions.
First, a robust prior for RBF networks is proposed. That is, the results do not depend
on any heuristics or thresholds. Second, an automated growing and pruning reversible
jump MCMC optimisation algorithm is designed to choose the model order according
to classical AIC, BIC and MDL criteria. This MCMC algorithm estimates the maximum
of the joint likelihood function of the radial basis centres and the number of bases using
simulated annealing. Finally, some geometric convergence theorems for the proposed
algorithms are presented.
Chapter 6: Sequential Monte Carlo Methods
Here, a novel strategy for training neural networks using sequential Monte Carlo (SMC)
algorithms is discussed. Various hybrid gradient descent/sampling importance resampling algorithms are proposed. In terms of both modelling flexibility and accuracy, SMC
algorithms provide a clear improvement over conventional Gaussian schemes. These
algorithms may be viewed as a global learning strategy to learn the probability distributions of the network weights and outputs in a sequential framework. They are also well
suited to applications involving on-line, nonlinear and non-Gaussian signal processing.

Introduction 22
Chapter 7: Sequential Bayesian Model Selection
This chapter extends the model selection strategy discussed in Chapter 5 to the sequential learning case. This problem does not usually admit any type of closed-form
analytical solutions and, as a result, one has to resort to numerical methods. The chapter proposes an original sequential simulation-based strategy to perform the necessary
computations. It combines sequential importance sampling, a selection procedure and
reversible jump MCMC moves. The effectiveness of the method is demonstrated by
applying it to radial basis function networks.
Chapter 8: Applications
This chapter demonstrates the performance of the various methods on some interesting real data sets. It includes comprehensive comparisons between the proposed algorithms.
Chapter 9: Conclusions
This final chapter summarises the theoretical and experimental results. It discusses
their relevance and suggests a few directions for further research