There it was hinted, by means of a simple example, that in order to obtain a good representation of the process being modelled, one needs to estimate the model complexity,

parameters and noise characteristics. In addition, it was mentioned that it is beneficial

to incorporatea priori knowledge so as to mitigate the ill-conditioned nature of the

learning problem. If we follow these specifications, we can almost assuredly obtain a

model that generalises well.

This chapter will briefly review the classical approaches to learning and generalisation in the neural networks field. Aside from regularisation with noise and committees of estimators, most of the standard methods fall into two broadly overlapping

categories: penalised likelihood and predictive assessment methods. Penalised likelihood methods involve placing a penalty term either on the model dimension or on the

smoothness of the response (Hinton, 1987; Le Cun et al., 1990; Poggio and Girosi,

1990). Predictive assessment strategies, such as the cross-validation, jacknife or bootstrap methods (Ripley, 1996; Stone, 1974; Stone, 1978; Wahba and Wold, 1969),

typically entail dividing the training data set into distinct subsets. The model is

subsequently trained using ~ M of the subsets and its performance is validated on

the omitted subset. The procedure is repeated for each of the subsets. This predictive assessment is often used to set the penalty parameters in the penalised likelihood

formulations.

These methods tend to lack a general and rigorous framework for incorporatinga

prioriknowledge into the modelling process. Furthermore, they do not provide suitable foundations for the study of generalisation in sequential learning. To surmount

these limitations, the Bayesian learning paradigm will be adopted in this thesis. This

approach will allow us to incorporatea priori knowledge into the modelling process

and to compute, jointly and within a probabilistic framework, the model parameters,

23

Learning and Generalisation 24

noise characteristics, model structure and regularisation coefficients. It will also allow

us to do this sequentially.

# Tag: machine learning algorithms

## Machine Learning1

Models are abstractions of reality to which experiments can be applied to improve

our understanding of phenomena in the world. They are at the heart of science and

permeate throughout most disciplines of human endeavour, including economics, engineering, medicine, politics, sociology and data management in general. Models can

be used to process data to predict future events or to organise data in ways that allow

information to be extracted from it.

There are two common approaches to constructing models. The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that

can be expressed by accepted relationships and physical laws. These subsystems are

typically arranged in the form of simulation blocks and sets of differential equations.

The model is consequently obtained by combining all the sub-models.

The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short. In this context, learning implies finding patterns in the data

or obtaining a parsimonious representation of data that can then be used for several

purposes such as forecasting or classification. Learning is of paramount importance

in nonlinear modelling due to the lack of a coherent and comprehensive theory for

nonlinear systems.

Learning from data is an ill-posed problem. That is, there is a continuum of solutions for any particular data set. Consequently, certain restrictions have to be imposed

on the form of the solution. Oftena priori knowledge about the phenomenon being

modelled is available in the form of physical laws, generally accepted heuristics or

mathematical relations. This knowledge should be incorporated into the modelling

process so as to reduce the set of possible solutions to one that provides reasonable results. The ill-posed nature and other inherent difficulties associated with the problem

12

Introduction 13

of learning can be clearly illustrated by means of a simple noisy interpolation example.

Consider the data plotted in Figure 1.1-A. It has been generated by the following

equation:

where represents the true function between the input and the output and

denotes zero mean uniformly distributed noise. The learning task is to use the noisy

data points plotted in Figure 1.1-A to estimate the true relation between the input and

the output.

We could attempt to model the data by polynomials of different order fitted to the

data by conventional least squares techniques. Let us assume that we try to fit second

and sixth order polynomials to the data. As shown in Figure 1.1-B, the 6th order

polynomial approximates the data better than the second order polynomial. However,

if we plot the true function and the two estimators as in Figure 1.1-C, we find that the

second order estimator provides a better approximation to the true function. Moreover,

the second order estimator provides a far better approximation to the true function for

novel (extrapolation) data, as depicted in Figure 1.1-D.

In conclusion, very complex estimators will approximate the training data points

better but may be worse estimators of the true function. Consequently, their predictions

for samples not encountered in the training data set may be worse than the predictions

produced by lower complexity estimators. The ability to predict well with samples

not encountered in the training data set is usually referred to as generalisation in the

machine learning literature. Note that if we had known the attributes of the noise

terma priori, we could have inferred that the 6th order polynomial was fitting it.

Alternatively, if we had had any data in the intervalM M # , we would have noticed

the problems associated with using the 6th order polynomial. The last two remarks

indicate, clearly, thata priori knowledge and the size and scope of the data set play a

significant role in learning.

The previous simple example has unveiled several of the difficulties that arise when

we try to infer models from noisy data, namely:

The learning problem is ill-posed. It contains infinitely many solutions.

Noise and limited training data pose limitations on the generalisation performance of the estimated models.

We have to select a set of nonlinear model structures with enough capacity to

approximate the true function.

We need techniques for fitting the selected models to the data