Machine Learning1

Models are abstractions of reality to which experiments can be applied to improve
our understanding of phenomena in the world. They are at the heart of science and
permeate throughout most disciplines of human endeavour, including economics, engineering, medicine, politics, sociology and data management in general. Models can
be used to process data to predict future events or to organise data in ways that allow
information to be extracted from it.
There are two common approaches to constructing models. The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that
can be expressed by accepted relationships and physical laws. These subsystems are
typically arranged in the form of simulation blocks and sets of differential equations.
The model is consequently obtained by combining all the sub-models.
The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short. In this context, learning implies finding patterns in the data
or obtaining a parsimonious representation of data that can then be used for several
purposes such as forecasting or classification. Learning is of paramount importance
in nonlinear modelling due to the lack of a coherent and comprehensive theory for
nonlinear systems.
Learning from data is an ill-posed problem. That is, there is a continuum of solutions for any particular data set. Consequently, certain restrictions have to be imposed
on the form of the solution. Often
a priori knowledge about the phenomenon being
modelled is available in the form of physical laws, generally accepted heuristics or
mathematical relations. This knowledge should be incorporated into the modelling
process so as to reduce the set of possible solutions to one that provides reasonable results. The ill-posed nature and other inherent difficulties associated with the problem
12

Introduction 13
of learning can be clearly illustrated by means of a simple noisy interpolation example.
Consider the data plotted in Figure 1.1-A. It has been generated by the following
equation:

Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

Deep Learning 

Deep Learning 

Deep Learning

Deep Learning


where   represents the true function between the input  and the output and 
denotes zero mean uniformly distributed noise. The learning task is to use the noisy
data points plotted in Figure 1.1-A to estimate the true relation between the input and
the output.
We could attempt to model the data by polynomials of different order fitted to the
data by conventional least squares techniques. Let us assume that we try to fit second
and sixth order polynomials to the data. As shown in Figure 1.1-B, the 6th order
polynomial approximates the data better than the second order polynomial. However,
if we plot the true function and the two estimators as in Figure 1.1-C, we find that the
second order estimator provides a better approximation to the true function. Moreover,
the second order estimator provides a far better approximation to the true function for
novel (extrapolation) data, as depicted in Figure 1.1-D.
In conclusion, very complex estimators will approximate the training data points
better but may be worse estimators of the true function. Consequently, their predictions
for samples not encountered in the training data set may be worse than the predictions
produced by lower complexity estimators. The ability to predict well with samples
not encountered in the training data set is usually referred to as generalisation in the
machine learning literature. Note that if we had known the attributes of the noise
term
a priori, we could have inferred that the 6th order polynomial was fitting it.
Alternatively, if we had had any data in the interval
M  M  # , we would have noticed
the problems associated with using the 6th order polynomial. The last two remarks
indicate, clearly, that
a priori knowledge and the size and scope of the data set play a
significant role in learning.
The previous simple example has unveiled several of the difficulties that arise when
we try to infer models from noisy data, namely:
The learning problem is ill-posed. It contains infinitely many solutions.
Noise and limited training data pose limitations on the generalisation performance of the estimated models.
We have to select a set of nonlinear model structures with enough capacity to
approximate the true function.
We need techniques for fitting the selected models to the data