Models are abstractions of reality to which experiments can be applied to improve

our understanding of phenomena in the world. They are at the heart of science and

permeate throughout most disciplines of human endeavour, including economics, engineering, medicine, politics, sociology and data management in general. Models can

be used to process data to predict future events or to organise data in ways that allow

information to be extracted from it.

There are two common approaches to constructing models. The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that

can be expressed by accepted relationships and physical laws. These subsystems are

typically arranged in the form of simulation blocks and sets of differential equations.

The model is consequently obtained by combining all the sub-models.

The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short. In this context, learning implies finding patterns in the data

or obtaining a parsimonious representation of data that can then be used for several

purposes such as forecasting or classification. Learning is of paramount importance

in nonlinear modelling due to the lack of a coherent and comprehensive theory for

nonlinear systems.

Learning from data is an ill-posed problem. That is, there is a continuum of solutions for any particular data set. Consequently, certain restrictions have to be imposed

on the form of the solution. Oftena priori knowledge about the phenomenon being

modelled is available in the form of physical laws, generally accepted heuristics or

mathematical relations. This knowledge should be incorporated into the modelling

process so as to reduce the set of possible solutions to one that provides reasonable results. The ill-posed nature and other inherent difficulties associated with the problem

12

Introduction 13

of learning can be clearly illustrated by means of a simple noisy interpolation example.

Consider the data plotted in Figure 1.1-A. It has been generated by the following

equation:

where represents the true function between the input and the output and

denotes zero mean uniformly distributed noise. The learning task is to use the noisy

data points plotted in Figure 1.1-A to estimate the true relation between the input and

the output.

We could attempt to model the data by polynomials of different order fitted to the

data by conventional least squares techniques. Let us assume that we try to fit second

and sixth order polynomials to the data. As shown in Figure 1.1-B, the 6th order

polynomial approximates the data better than the second order polynomial. However,

if we plot the true function and the two estimators as in Figure 1.1-C, we find that the

second order estimator provides a better approximation to the true function. Moreover,

the second order estimator provides a far better approximation to the true function for

novel (extrapolation) data, as depicted in Figure 1.1-D.

In conclusion, very complex estimators will approximate the training data points

better but may be worse estimators of the true function. Consequently, their predictions

for samples not encountered in the training data set may be worse than the predictions

produced by lower complexity estimators. The ability to predict well with samples

not encountered in the training data set is usually referred to as generalisation in the

machine learning literature. Note that if we had known the attributes of the noise

terma priori, we could have inferred that the 6th order polynomial was fitting it.

Alternatively, if we had had any data in the intervalM M # , we would have noticed

the problems associated with using the 6th order polynomial. The last two remarks

indicate, clearly, thata priori knowledge and the size and scope of the data set play a

significant role in learning.

The previous simple example has unveiled several of the difficulties that arise when

we try to infer models from noisy data, namely:

The learning problem is ill-posed. It contains infinitely many solutions.

Noise and limited training data pose limitations on the generalisation performance of the estimated models.

We have to select a set of nonlinear model structures with enough capacity to

approximate the true function.

We need techniques for fitting the selected models to the data