There it was hinted, by means of a simple example, that in order to obtain a good representation of the process being modelled, one needs to estimate the model complexity,

parameters and noise characteristics. In addition, it was mentioned that it is beneficial

to incorporatea priori knowledge so as to mitigate the ill-conditioned nature of the

learning problem. If we follow these specifications, we can almost assuredly obtain a

model that generalises well.

This chapter will briefly review the classical approaches to learning and generalisation in the neural networks field. Aside from regularisation with noise and committees of estimators, most of the standard methods fall into two broadly overlapping

categories: penalised likelihood and predictive assessment methods. Penalised likelihood methods involve placing a penalty term either on the model dimension or on the

smoothness of the response (Hinton, 1987; Le Cun et al., 1990; Poggio and Girosi,

1990). Predictive assessment strategies, such as the cross-validation, jacknife or bootstrap methods (Ripley, 1996; Stone, 1974; Stone, 1978; Wahba and Wold, 1969),

typically entail dividing the training data set into distinct subsets. The model is

subsequently trained using ~ M of the subsets and its performance is validated on

the omitted subset. The procedure is repeated for each of the subsets. This predictive assessment is often used to set the penalty parameters in the penalised likelihood

formulations.

These methods tend to lack a general and rigorous framework for incorporatinga

prioriknowledge into the modelling process. Furthermore, they do not provide suitable foundations for the study of generalisation in sequential learning. To surmount

these limitations, the Bayesian learning paradigm will be adopted in this thesis. This

approach will allow us to incorporatea priori knowledge into the modelling process

and to compute, jointly and within a probabilistic framework, the model parameters,

23

Learning and Generalisation 24

noise characteristics, model structure and regularisation coefficients. It will also allow

us to do this sequentially.

## Machine learning 2

Many physical processes may be described by the following nonlinear, multivariate variable over the data. Depending on how the data is gathered, we can identify two types of learning: batch learning and sequential learning. In the context of batch

learning, the learning problem involves computing an approximation to the function and estimating the characteristics of the noise process given a set of I input-output

observations:

In contrast, in the sequential learning scenario, the observations arrive one at a time.

Typical instances of the learning problem include regression, where is continuous; classification, where @ corresponds to a discrete group of classes; and nonlinear dynamical system identification, where the inputs and targets correspond to several

delayed versions of the signals under consideration.

The disturbances may represent both measurement noise and unknown inputs.

This study will assume that they can be added directly to the output. The basis for

this assumption is that noise in the input together with other system disturbances will

propagate through the system and therefore can be lumped into one single measurement noise term (Ljung, 1987). When introducing sequential Monte Carlo methods in

Chapter 6, this assumption will be weakened by adopting a more general formulation:

In some scenarios, one might, however, be interested in modelling the

distribution of the input data0 (Cornford et al., 1998; Wright, 1998). This topic,

however, lies beyond the scope of this thesis.

The goal of learning, as posed here, is to obtain a description of the conditional

distribution0 . As the dimension of this distribution can be very large, it is convenient to adopt a variational approach and project it into a lower dimensional space.

This can be accomplished by introducing a set of parameters leading to the

distribution0 . For example, if we believe that the data has been generated by

a Gaussian distribution, we only need two sufficient statistics to describe it, namely its

mean and covariance. These statistics can, in turn, be described by a low-dimensional

set of parameters. These parameters will allow us to infer the outputs@ whenever we

observe new values of the inputs .

The regression function of is a multivariate, nonlinear and

is adopted to denote all the observations corresponding to the th output

. To simplify the notation, is equivalent to That is, if one index does not

appear, it is implied that we are referring to all of its possible values. Similarly, is equivalent to

The shorter notation will be favoured. The longer notation will only be invoked to avoid ambiguities and

emphasise certain dependencies.

Introduction 16

possibly time-varying mapping. When the exact nonlinear structure of this mapping

cannot be establisheda priori, it may be synthesised as a combination of parametrised

basis functions. That is:

denotes a multivariate basis function. These multivariate basis

functions may be generated from univariate basis functions using radial basis, tensor product or ridge construction methods. This type of modelling is often referred to

as “non-parametric” regression because the number of basis functions is typically very

large. Equation (1.2) encompasses a large number of nonlinear estimation methods

including projection pursuit regression (Friedman and Stuetzle, 1981; Huber, 1985),

Volterra series (Billings, 1980; Mathews, 1991), fuzzy inference systems (Jang and Sun,

1993), generalised linear models (Nelder and Wedderburn, 1972), multivariate adaptive regression splines (MARS) (Denison, 1998; Friedman, 1991) and many artificial

neural network paradigms such as functional link networks (Pao, 1989), multi-layer

perceptrons (MLPs) (Rosenblatt, 1959; Rumelhart et al., 1986), radial basis function

networks (RBFs) (Lowe, 1989; Moody and Darken, 1988; Poggio and Girosi, 1990),

wavelet networks (Bakshi and Stephanopoulos, 1993; Juditsky et al., 1995) and hinging hyper-planes (Breiman, 1993). For an introduction to neural networks, the reader

may consult any of the following books (Bishop, 1995b; Haykin, 1994; Hecht-Nielsen,

1990; Ripley, 1996).

Neural networks can approximate any continuous function arbitrarily well as the

number of neurons (basis functions) increases without bound (Cybenko, 1989; Hornik

et al., 1989; Poggio and Girosi, 1990). In addition, they have been successfully applied

to many complex problems, including speech recognition (Robinson, 1994), hand written digit recognition (Le Cun et al., 1989), financial modelling (Refenes, 1995) and

medical diagnosis (Baxt, 1990) among others. This thesis will consider two types of

neural network architectures: fixed dimension MLPs and variable dimension RBFs.

MLPs have enjoyed a privileged position in the neural networks community because

of their simplicity, approximating power, relation to biological systems and various historical reasons. Figure 1.2 shows a typical two hidden layer MLP with logistic sigmoid

basis functions in the hidden layers and a single output linear neuron. Networks of this

type can be represented mathematically as follows:

containing the weights connecting each input with the* th neuron. The logistic sigmoid

Introduction 17

function is given by:

If our goal is to perform classification, then it is convenient to employ a logistic funcb

Figure 1.2 Typical multi-layer perceptron architecture.

tion in the output layer. This allows us to interpret the outputs of the network as

probabilities of class membership. Although the MLPs discussed in this thesis exhibit

a feed-forward architecture, they can be easily extended to recurrent schemes by the

addition of multiple feedback connections or tapped delay lines (de Freitas et al., 1996;

Narendra and Parthasarathy, 1990; Puskorius and Feldkamp, 1994; Qin et al., 1992;

Sjoberg, ¤ 1995; Yamada and Yabuta, 1993).

RBF networks tend to be more tractable than MLPs. In these models, the training

of the parameters corresponding to different layers is, to a large extent, decoupled.

Chapters 5 and 7 will discuss an approximation scheme consisting of a mixture of

RBFs and a linear regression term (Holmes and Mallick, 1998). The number of basis

functions will be estimated from the data. Thus, unless the data is nonlinear, the model

collapses to a standard linear model. More precisely, the linear-RBF model is given

by:

Introduction 18

where, in this case,, p ^ p denotes a distance metric (usually Euclidean or

Mahalanobis),N = denotes the -th RBF centre for a model with RBFs, 5N =

denotes the -th RBF amplitude and N = and N = 1 = denotes the linear

regression parameters. Figure 1.3 depicts the approximation model for

and T . Depending on our a priori knowledge about the smoothness of the mapping,

Figure 1.3 Linear-RBF approximation model with three radial basis functions, two inputs and two

outputs. The solid lines indicate weighted connections.

we can choose different types of basis functions (Girosi et al., 1995). The most common

choices are:

For the last two choices of basis functions, ( will be treated as a user-set parameter.

Nevertheless, the Monte Carlo estimation strategies described in Chapters 5, 6 and 7

can treat the choice of basis functions as a model selection problem. It is possible to

place a prior distribution on the basis functions and allow the Monte Carlo algorithms

to decide which of them provide a better solution.

Introduction 19

1.2 Scope and Contributions of this Thesis

In the example presented earlier, it was shown that the ability of a model to predict accurately with novel data depends on the amount of data, the complexity of the model

and the noise in the data. It was then argued that artificial neural networks provide a

general and flexible nonlinear modelling strategy. From this standpoint, the learning

problem involves estimating the neural network’s parameters, the number of parameters, the type of basis functions and the statistics of the noise. In addition, we might

have to select the most appropriate set of input signals.

A great deal of effort has been devoted to the solution of the parameter estimation

problem. The other problems have received less attention. In contrast, the issues of

noise estimation and model selection will be central to the scope of this thesis. It will be

possible to manage these more demanding tasks by embracing the Bayesian learning

paradigm. Despite the fact that the problems of input variable selection and basis

function selection are not treated explicitly, the solution to these is a natural extension

of the model selection frameworks presented in Chapters 6 and 7.

Another important theme in this thesis is the issue of sequential learning and inference. Sequential training methods for neural networks are important in many applications involving real-time signal processing, where data arrival is inherently sequential.

Furthermore, one might wish to adopt a sequential processing strategy to deal with

non-stationarity in signals, so that information from the recent past is given greater

weight than information from the distant past. Computational simplicity in the form

of not having to store all the data might also constitute an additional motivating factor

for sequential methods.

This thesis proposes the following:

A novel approach to perform regularisation in sequential learning. This approach

establishes theoretical links between extended Kalman filters with adaptive noise

estimation, gradient descent methods with multiple adaptive learning rates and

training methods with multiple smoothing regularisation coefficients.

An expectation maximisation (EM) algorithm to estimate the parameters of an

MLP, the noise statistics and the model uncertainty jointly. The method is applicable to non-stationary parameter spaces.

A robust Bayesian method to estimate, jointly, the parameters, number of parameters, noise statistics and signal to noise ratios of an RBF network. The necessary

computations are performed using a reversible jump Markov chain Monte Carlo

(MCMC) simulation method. In addition, it presents an efficient reversible jump

MCMC simulated annealing strategy to perform global optimisation of RBF net-

Introduction 20

works. Furthermore, it proves the convergence of these algorithms rigorously2.

The use of particle filters and sequential Monte Carlo (SMC) methods to the neural networks field. In doing so, new SMC algorithms are devised to deal with the

high dimensional parameter spaces inherent to neural network models. These algorithms are suitable for nonlinear, non-Gaussian and non-stationary modelling.

A new and general sequential Monte Carlo approach to perform sequential noise

estimation and model selection. The method is demonstrated on RBF networks.

1.3 Thesis Organisation

The chapters are, to a large extent, self contained and can be read independently. Chapter 2 is of an introductory nature. At the end of each of the main chapters, Chapters

3 to 7, the proposed methods and algorithms are demonstrated on experiments with

synthetic data. In Chapter 8, further tests on a few real problems are presented. A

summary of the thesis follows:

Chapter 2: Learning and Generalisation

This chapter provides a brief review of learning theory from a neural networks perspective. It addresses both the classical and Bayesian approaches. In addition, it introduces

the sequential learning problem.

Chapter 3: Sequential Bayesian Learning with Gaussian Approximations

Sequential learning methods, in particular Gaussian approximation schemes, are introduced in this chapter. It is shown that an hierarchical Bayesian modelling approach

enables one to perform regularisation in sequential learning. Three inference levels

are identified within this hierarchy, namely model selection, parameter estimation and

noise estimation. In environments where data arrives sequentially, techniques such

as cross-validation to achieve regularisation or model selection are not possible. The

Bayesian approach, with extended Kalman filtering at the parameter estimation level,

allows for regularisation within a minimum variance framework. A multi-layer perceptron is used to generate the extended Kalman filter nonlinear measurements mapping.

Several algorithms are described at the noise estimation level, thus permitting the implementation of on-line regularisation. Another contribution of this chapter is to show

2These contributions were strongly motivated by the work of Christophe Andrieu and Arnaud Doucet

(Andrieu and Doucet, 1998b; Andrieu and Doucet, 1998a; Andrieu and Doucet, 1999).

Introduction 21

the theoretical links between adaptive noise estimation in extended Kalman filtering,

multiple adaptive learning rates and multiple smoothing regularisation coefficients.

Chapter 4: Dynamic Batch Learning with the EM Algorithm

This chapter extends the sequential Gaussian approximation framework discussed in

the previous chapter to the batch learning scenario. In it, an EM algorithm for nonlinear

state space models is derived. It is used to estimate jointly the neural network weights,

the model uncertainty and the noise in the data. In the E-step, a forward-backward

Rauch-Tung-Striebel smoother is adopted to compute the network weights. For the

M-step, analytical expressions are derived to compute the model uncertainty and the

measurement noise. The method is shown to be intrinsically very powerful, simple and

stable.

Chapter 5: Robust Full Bayesian Learning with MCMC

This chapter begins the presentation of Monte Carlo methods, a major theme in this

thesis. The reversible jump MCMC simulation algorithm is applied to RBF networks, so

as to compute the joint posterior distribution of the radial basis centres and the number

of basis functions. This area of research is advanced in three important directions.

First, a robust prior for RBF networks is proposed. That is, the results do not depend

on any heuristics or thresholds. Second, an automated growing and pruning reversible

jump MCMC optimisation algorithm is designed to choose the model order according

to classical AIC, BIC and MDL criteria. This MCMC algorithm estimates the maximum

of the joint likelihood function of the radial basis centres and the number of bases using

simulated annealing. Finally, some geometric convergence theorems for the proposed

algorithms are presented.

Chapter 6: Sequential Monte Carlo Methods

Here, a novel strategy for training neural networks using sequential Monte Carlo (SMC)

algorithms is discussed. Various hybrid gradient descent/sampling importance resampling algorithms are proposed. In terms of both modelling flexibility and accuracy, SMC

algorithms provide a clear improvement over conventional Gaussian schemes. These

algorithms may be viewed as a global learning strategy to learn the probability distributions of the network weights and outputs in a sequential framework. They are also well

suited to applications involving on-line, nonlinear and non-Gaussian signal processing.

Introduction 22

Chapter 7: Sequential Bayesian Model Selection

This chapter extends the model selection strategy discussed in Chapter 5 to the sequential learning case. This problem does not usually admit any type of closed-form

analytical solutions and, as a result, one has to resort to numerical methods. The chapter proposes an original sequential simulation-based strategy to perform the necessary

computations. It combines sequential importance sampling, a selection procedure and

reversible jump MCMC moves. The effectiveness of the method is demonstrated by

applying it to radial basis function networks.

Chapter 8: Applications

This chapter demonstrates the performance of the various methods on some interesting real data sets. It includes comprehensive comparisons between the proposed algorithms.

Chapter 9: Conclusions

This final chapter summarises the theoretical and experimental results. It discusses

their relevance and suggests a few directions for further research

## Machine Learning1

Models are abstractions of reality to which experiments can be applied to improve

our understanding of phenomena in the world. They are at the heart of science and

permeate throughout most disciplines of human endeavour, including economics, engineering, medicine, politics, sociology and data management in general. Models can

be used to process data to predict future events or to organise data in ways that allow

information to be extracted from it.

There are two common approaches to constructing models. The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that

can be expressed by accepted relationships and physical laws. These subsystems are

typically arranged in the form of simulation blocks and sets of differential equations.

The model is consequently obtained by combining all the sub-models.

The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short. In this context, learning implies finding patterns in the data

or obtaining a parsimonious representation of data that can then be used for several

purposes such as forecasting or classification. Learning is of paramount importance

in nonlinear modelling due to the lack of a coherent and comprehensive theory for

nonlinear systems.

Learning from data is an ill-posed problem. That is, there is a continuum of solutions for any particular data set. Consequently, certain restrictions have to be imposed

on the form of the solution. Oftena priori knowledge about the phenomenon being

modelled is available in the form of physical laws, generally accepted heuristics or

mathematical relations. This knowledge should be incorporated into the modelling

process so as to reduce the set of possible solutions to one that provides reasonable results. The ill-posed nature and other inherent difficulties associated with the problem

12

Introduction 13

of learning can be clearly illustrated by means of a simple noisy interpolation example.

Consider the data plotted in Figure 1.1-A. It has been generated by the following

equation:

where represents the true function between the input and the output and

denotes zero mean uniformly distributed noise. The learning task is to use the noisy

data points plotted in Figure 1.1-A to estimate the true relation between the input and

the output.

We could attempt to model the data by polynomials of different order fitted to the

data by conventional least squares techniques. Let us assume that we try to fit second

and sixth order polynomials to the data. As shown in Figure 1.1-B, the 6th order

polynomial approximates the data better than the second order polynomial. However,

if we plot the true function and the two estimators as in Figure 1.1-C, we find that the

second order estimator provides a better approximation to the true function. Moreover,

the second order estimator provides a far better approximation to the true function for

novel (extrapolation) data, as depicted in Figure 1.1-D.

In conclusion, very complex estimators will approximate the training data points

better but may be worse estimators of the true function. Consequently, their predictions

for samples not encountered in the training data set may be worse than the predictions

produced by lower complexity estimators. The ability to predict well with samples

not encountered in the training data set is usually referred to as generalisation in the

machine learning literature. Note that if we had known the attributes of the noise

terma priori, we could have inferred that the 6th order polynomial was fitting it.

Alternatively, if we had had any data in the intervalM M # , we would have noticed

the problems associated with using the 6th order polynomial. The last two remarks

indicate, clearly, thata priori knowledge and the size and scope of the data set play a

significant role in learning.

The previous simple example has unveiled several of the difficulties that arise when

we try to infer models from noisy data, namely:

The learning problem is ill-posed. It contains infinitely many solutions.

Noise and limited training data pose limitations on the generalisation performance of the estimated models.

We have to select a set of nonlinear model structures with enough capacity to

approximate the true function.

We need techniques for fitting the selected models to the data