Machine learning 3

There it was hinted, by means of a simple example, that in order to obtain a good representation of the process being modelled, one needs to estimate the model complexity,
parameters and noise characteristics. In addition, it was mentioned that it is beneficial
to incorporate
a priori knowledge so as to mitigate the ill-conditioned nature of the
learning problem. If we follow these specifications, we can almost assuredly obtain a
model that generalises well.
This chapter will briefly review the classical approaches to learning and generalisation in the neural networks field. Aside from regularisation with noise and committees of estimators, most of the standard methods fall into two broadly overlapping
categories: penalised likelihood and predictive assessment methods. Penalised likelihood methods involve placing a penalty term either on the model dimension or on the
smoothness of the response (Hinton, 1987; Le Cun et al., 1990; Poggio and Girosi,
1990). Predictive assessment strategies, such as the cross-validation, jacknife or bootstrap methods (Ripley, 1996; Stone, 1974; Stone, 1978; Wahba and Wold, 1969),
typically entail dividing the training data set into
distinct subsets. The model is
subsequently trained using
~ M of the subsets and its performance is validated on
the omitted subset. The procedure is repeated for each of the subsets. This predictive assessment is often used to set the penalty parameters in the penalised likelihood
formulations.
These methods tend to lack a general and rigorous framework for incorporating
a
priori
knowledge into the modelling process. Furthermore, they do not provide suitable foundations for the study of generalisation in sequential learning. To surmount
these limitations, the Bayesian learning paradigm will be adopted in this thesis. This
approach will allow us to incorporate
a priori knowledge into the modelling process
and to compute, jointly and within a probabilistic framework, the model parameters,
23

Learning and Generalisation 24
noise characteristics, model structure and regularisation coefficients. It will also allow
us to do this sequentially.

Machine learning 2

Many physical processes may be described by the following nonlinear, multivariate variable over the data. Depending on how the data is gathered, we can identify two types of learning: batch learning and sequential learning. In the context of batch
learning, the learning problem involves computing an approximation to the function 
and estimating the characteristics of the noise process given a set of I input-output
observations:

In contrast, in the sequential learning scenario, the observations arrive one at a time.
Typical instances of the learning problem include regression, where
 is continuous; classification, where @ corresponds to a discrete group of classes; and nonlinear dynamical system identification, where the inputs and targets correspond to several
delayed versions of the signals under consideration.
The disturbances 
may represent both measurement noise and unknown inputs.
This study will assume that they can be added directly to the output. The basis for
this assumption is that noise in the input together with other system disturbances will
propagate through the system and therefore can be lumped into one single measurement noise term (Ljung, 1987). When introducing sequential Monte Carlo methods in
Chapter 6, this assumption will be weakened by adopting a more general formulation:
In some scenarios, one might, however, be interested in modelling the
distribution of the input data
0   (Cornford et al., 1998; Wright, 1998). This topic,
however, lies beyond the scope of this thesis.
The goal of learning, as posed here, is to obtain a description of the conditional
distribution
0 . As the dimension of this distribution can be very large, it is convenient to adopt a variational approach and project it into a lower dimensional space.
This can be accomplished by introducing a set of parameters 
 leading to the
distribution
0 . For example, if we believe that the data has been generated by
a Gaussian distribution, we only need two sufficient statistics to describe it, namely its
mean and covariance. These statistics can, in turn, be described by a low-dimensional
set of parameters. These parameters will allow us to infer the outputs
@ whenever we
observe new values of the inputs
 .
The regression function of 
is a multivariate, nonlinear and
 is adopted to denote all the observations corresponding to the th output
. To simplify the notation, is equivalent to  That is, if one index does not
appear, it is implied that we are referring to all of its possible values. Similarly,
 is equivalent to
The shorter notation will be favoured. The longer notation will only be invoked to avoid ambiguities and
emphasise certain dependencies.

Introduction 16
possibly time-varying mapping. When the exact nonlinear structure of this mapping
cannot be established
a priori, it may be synthesised as a combination of parametrised
basis functions. That is:
denotes a multivariate basis function. These multivariate basis
functions may be generated from univariate basis functions using radial basis, tensor product or ridge construction methods. This type of modelling is often referred to
as “non-parametric” regression because the number of basis functions is typically very
large. Equation (1.2) encompasses a large number of nonlinear estimation methods
including projection pursuit regression (Friedman and Stuetzle, 1981; Huber, 1985),
Volterra series (Billings, 1980; Mathews, 1991), fuzzy inference systems (Jang and Sun,
1993), generalised linear models (Nelder and Wedderburn, 1972), multivariate adaptive regression splines (MARS) (Denison, 1998; Friedman, 1991) and many artificial
neural network paradigms such as functional link networks (Pao, 1989), multi-layer
perceptrons (MLPs) (Rosenblatt, 1959; Rumelhart et al., 1986), radial basis function
networks (RBFs) (Lowe, 1989; Moody and Darken, 1988; Poggio and Girosi, 1990),
wavelet networks (Bakshi and Stephanopoulos, 1993; Juditsky et al., 1995) and hinging hyper-planes (Breiman, 1993). For an introduction to neural networks, the reader
may consult any of the following books (Bishop, 1995b; Haykin, 1994; Hecht-Nielsen,
1990; Ripley, 1996).
Neural networks can approximate any continuous function arbitrarily well as the
number of neurons (basis functions) increases without bound (Cybenko, 1989; Hornik
et al., 1989; Poggio and Girosi, 1990). In addition, they have been successfully applied
to many complex problems, including speech recognition (Robinson, 1994), hand written digit recognition (Le Cun et al., 1989), financial modelling (Refenes, 1995) and
medical diagnosis (Baxt, 1990) among others. This thesis will consider two types of
neural network architectures: fixed dimension MLPs and variable dimension RBFs.
MLPs have enjoyed a privileged position in the neural networks community because
of their simplicity, approximating power, relation to biological systems and various historical reasons. Figure 1.2 shows a typical two hidden layer MLP with logistic sigmoid
basis functions in the hidden layers and a single output linear neuron. Networks of this
type can be represented mathematically as follows:

containing the weights connecting each input with the
* th neuron. The logistic sigmoid
Introduction 17
function is given by:


If our goal is to perform classification, then it is convenient to employ a logistic funcb
Figure 1.2 Typical multi-layer perceptron architecture.
tion in the output layer. This allows us to interpret the outputs of the network as
probabilities of class membership. Although the MLPs discussed in this thesis exhibit
a feed-forward architecture, they can be easily extended to recurrent schemes by the
addition of multiple feedback connections or tapped delay lines (de Freitas et al., 1996;
Narendra and Parthasarathy, 1990; Puskorius and Feldkamp, 1994; Qin et al., 1992;
Sjoberg, ¤ 1995; Yamada and Yabuta, 1993).
RBF networks tend to be more tractable than MLPs. In these models, the training
of the parameters corresponding to different layers is, to a large extent, decoupled.
Chapters 5 and 7 will discuss an approximation scheme consisting of a mixture of

RBFs and a linear regression term (Holmes and Mallick, 1998). The number of basis
functions will be estimated from the data. Thus, unless the data is nonlinear, the model
collapses to a standard linear model. More precisely, the linear-RBF model
 is given
by:

Introduction 18
where, in this case,
, p ^ p denotes a distance metric (usually Euclidean or
Mahalanobis),
N = denotes the -th RBF centre for a model with RBFs, 5N = 
denotes the -th RBF amplitude and  N =  and  N = 1 =  denotes the linear
regression parameters. Figure 1.3 depicts the approximation model for

and T  . Depending on our a priori knowledge about the smoothness of the mapping,

Figure 1.3 Linear-RBF approximation model with three radial basis functions, two inputs and two
outputs. The solid lines indicate weighted connections.
we can choose different types of basis functions (Girosi et al., 1995). The most common
choices are:

For the last two choices of basis functions, ( will be treated as a user-set parameter.
Nevertheless, the Monte Carlo estimation strategies described in Chapters 5, 6 and 7
can treat the choice of basis functions as a model selection problem. It is possible to
place a prior distribution on the basis functions and allow the Monte Carlo algorithms
to decide which of them provide a better solution.

Introduction 19
1.2 Scope and Contributions of this Thesis
In the example presented earlier, it was shown that the ability of a model to predict accurately with novel data depends on the amount of data, the complexity of the model
and the noise in the data. It was then argued that artificial neural networks provide a
general and flexible nonlinear modelling strategy. From this standpoint, the learning
problem involves estimating the neural network’s parameters, the number of parameters, the type of basis functions and the statistics of the noise. In addition, we might
have to select the most appropriate set of input signals.
A great deal of effort has been devoted to the solution of the parameter estimation
problem. The other problems have received less attention. In contrast, the issues of
noise estimation and model selection will be central to the scope of this thesis. It will be
possible to manage these more demanding tasks by embracing the Bayesian learning
paradigm. Despite the fact that the problems of input variable selection and basis
function selection are not treated explicitly, the solution to these is a natural extension
of the model selection frameworks presented in Chapters 6 and 7.
Another important theme in this thesis is the issue of sequential learning and inference. Sequential training methods for neural networks are important in many applications involving real-time signal processing, where data arrival is inherently sequential.
Furthermore, one might wish to adopt a sequential processing strategy to deal with
non-stationarity in signals, so that information from the recent past is given greater
weight than information from the distant past. Computational simplicity in the form
of not having to store all the data might also constitute an additional motivating factor
for sequential methods.
This thesis proposes the following:
A novel approach to perform regularisation in sequential learning. This approach
establishes theoretical links between extended Kalman filters with adaptive noise
estimation, gradient descent methods with multiple adaptive learning rates and
training methods with multiple smoothing regularisation coefficients.
An expectation maximisation (EM) algorithm to estimate the parameters of an
MLP, the noise statistics and the model uncertainty jointly. The method is applicable to non-stationary parameter spaces.
A robust Bayesian method to estimate, jointly, the parameters, number of parameters, noise statistics and signal to noise ratios of an RBF network. The necessary
computations are performed using a reversible jump Markov chain Monte Carlo
(MCMC) simulation method. In addition, it presents an efficient reversible jump
MCMC simulated annealing strategy to perform global optimisation of RBF net-

Introduction 20
works. Furthermore, it proves the convergence of these algorithms rigorously
2.
The use of particle filters and sequential Monte Carlo (SMC) methods to the neural networks field. In doing so, new SMC algorithms are devised to deal with the
high dimensional parameter spaces inherent to neural network models. These algorithms are suitable for nonlinear, non-Gaussian and non-stationary modelling.
A new and general sequential Monte Carlo approach to perform sequential noise
estimation and model selection. The method is demonstrated on RBF networks.
1.3 Thesis Organisation
The chapters are, to a large extent, self contained and can be read independently. Chapter 2 is of an introductory nature. At the end of each of the main chapters, Chapters
3 to 7, the proposed methods and algorithms are demonstrated on experiments with
synthetic data. In Chapter 8, further tests on a few real problems are presented. A
summary of the thesis follows:
Chapter 2: Learning and Generalisation
This chapter provides a brief review of learning theory from a neural networks perspective. It addresses both the classical and Bayesian approaches. In addition, it introduces
the sequential learning problem.
Chapter 3: Sequential Bayesian Learning with Gaussian Approximations
Sequential learning methods, in particular Gaussian approximation schemes, are introduced in this chapter. It is shown that an hierarchical Bayesian modelling approach
enables one to perform regularisation in sequential learning. Three inference levels
are identified within this hierarchy, namely model selection, parameter estimation and
noise estimation. In environments where data arrives sequentially, techniques such
as cross-validation to achieve regularisation or model selection are not possible. The
Bayesian approach, with extended Kalman filtering at the parameter estimation level,
allows for regularisation within a minimum variance framework. A multi-layer perceptron is used to generate the extended Kalman filter nonlinear measurements mapping.
Several algorithms are described at the noise estimation level, thus permitting the implementation of on-line regularisation. Another contribution of this chapter is to show
2These contributions were strongly motivated by the work of Christophe Andrieu and Arnaud Doucet
(Andrieu and Doucet, 1998b; Andrieu and Doucet, 1998a; Andrieu and Doucet, 1999).

Introduction 21
the theoretical links between adaptive noise estimation in extended Kalman filtering,
multiple adaptive learning rates and multiple smoothing regularisation coefficients.
Chapter 4: Dynamic Batch Learning with the EM Algorithm
This chapter extends the sequential Gaussian approximation framework discussed in
the previous chapter to the batch learning scenario. In it, an EM algorithm for nonlinear
state space models is derived. It is used to estimate jointly the neural network weights,
the model uncertainty and the noise in the data. In the E-step, a forward-backward
Rauch-Tung-Striebel smoother is adopted to compute the network weights. For the
M-step, analytical expressions are derived to compute the model uncertainty and the
measurement noise. The method is shown to be intrinsically very powerful, simple and
stable.
Chapter 5: Robust Full Bayesian Learning with MCMC
This chapter begins the presentation of Monte Carlo methods, a major theme in this
thesis. The reversible jump MCMC simulation algorithm is applied to RBF networks, so
as to compute the joint posterior distribution of the radial basis centres and the number
of basis functions. This area of research is advanced in three important directions.
First, a robust prior for RBF networks is proposed. That is, the results do not depend
on any heuristics or thresholds. Second, an automated growing and pruning reversible
jump MCMC optimisation algorithm is designed to choose the model order according
to classical AIC, BIC and MDL criteria. This MCMC algorithm estimates the maximum
of the joint likelihood function of the radial basis centres and the number of bases using
simulated annealing. Finally, some geometric convergence theorems for the proposed
algorithms are presented.
Chapter 6: Sequential Monte Carlo Methods
Here, a novel strategy for training neural networks using sequential Monte Carlo (SMC)
algorithms is discussed. Various hybrid gradient descent/sampling importance resampling algorithms are proposed. In terms of both modelling flexibility and accuracy, SMC
algorithms provide a clear improvement over conventional Gaussian schemes. These
algorithms may be viewed as a global learning strategy to learn the probability distributions of the network weights and outputs in a sequential framework. They are also well
suited to applications involving on-line, nonlinear and non-Gaussian signal processing.

Introduction 22
Chapter 7: Sequential Bayesian Model Selection
This chapter extends the model selection strategy discussed in Chapter 5 to the sequential learning case. This problem does not usually admit any type of closed-form
analytical solutions and, as a result, one has to resort to numerical methods. The chapter proposes an original sequential simulation-based strategy to perform the necessary
computations. It combines sequential importance sampling, a selection procedure and
reversible jump MCMC moves. The effectiveness of the method is demonstrated by
applying it to radial basis function networks.
Chapter 8: Applications
This chapter demonstrates the performance of the various methods on some interesting real data sets. It includes comprehensive comparisons between the proposed algorithms.
Chapter 9: Conclusions
This final chapter summarises the theoretical and experimental results. It discusses
their relevance and suggests a few directions for further research

Machine Learning1

Models are abstractions of reality to which experiments can be applied to improve
our understanding of phenomena in the world. They are at the heart of science and
permeate throughout most disciplines of human endeavour, including economics, engineering, medicine, politics, sociology and data management in general. Models can
be used to process data to predict future events or to organise data in ways that allow
information to be extracted from it.
There are two common approaches to constructing models. The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that
can be expressed by accepted relationships and physical laws. These subsystems are
typically arranged in the form of simulation blocks and sets of differential equations.
The model is consequently obtained by combining all the sub-models.
The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short. In this context, learning implies finding patterns in the data
or obtaining a parsimonious representation of data that can then be used for several
purposes such as forecasting or classification. Learning is of paramount importance
in nonlinear modelling due to the lack of a coherent and comprehensive theory for
nonlinear systems.
Learning from data is an ill-posed problem. That is, there is a continuum of solutions for any particular data set. Consequently, certain restrictions have to be imposed
on the form of the solution. Often
a priori knowledge about the phenomenon being
modelled is available in the form of physical laws, generally accepted heuristics or
mathematical relations. This knowledge should be incorporated into the modelling
process so as to reduce the set of possible solutions to one that provides reasonable results. The ill-posed nature and other inherent difficulties associated with the problem
12

Introduction 13
of learning can be clearly illustrated by means of a simple noisy interpolation example.
Consider the data plotted in Figure 1.1-A. It has been generated by the following
equation:

Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

Deep Learning 

Deep Learning 

Deep Learning

Deep Learning


where   represents the true function between the input  and the output and 
denotes zero mean uniformly distributed noise. The learning task is to use the noisy
data points plotted in Figure 1.1-A to estimate the true relation between the input and
the output.
We could attempt to model the data by polynomials of different order fitted to the
data by conventional least squares techniques. Let us assume that we try to fit second
and sixth order polynomials to the data. As shown in Figure 1.1-B, the 6th order
polynomial approximates the data better than the second order polynomial. However,
if we plot the true function and the two estimators as in Figure 1.1-C, we find that the
second order estimator provides a better approximation to the true function. Moreover,
the second order estimator provides a far better approximation to the true function for
novel (extrapolation) data, as depicted in Figure 1.1-D.
In conclusion, very complex estimators will approximate the training data points
better but may be worse estimators of the true function. Consequently, their predictions
for samples not encountered in the training data set may be worse than the predictions
produced by lower complexity estimators. The ability to predict well with samples
not encountered in the training data set is usually referred to as generalisation in the
machine learning literature. Note that if we had known the attributes of the noise
term
a priori, we could have inferred that the 6th order polynomial was fitting it.
Alternatively, if we had had any data in the interval
M  M  # , we would have noticed
the problems associated with using the 6th order polynomial. The last two remarks
indicate, clearly, that
a priori knowledge and the size and scope of the data set play a
significant role in learning.
The previous simple example has unveiled several of the difficulties that arise when
we try to infer models from noisy data, namely:
The learning problem is ill-posed. It contains infinitely many solutions.
Noise and limited training data pose limitations on the generalisation performance of the estimated models.
We have to select a set of nonlinear model structures with enough capacity to
approximate the true function.
We need techniques for fitting the selected models to the data