Explanatory Data Analysis (EDA) in statistics is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA. Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians’ work on scientific and engineering problems. There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude than by particular techniques. Typical graphical techniques used in EDA are: Box plot, Histogram, Multi-vari chart, Run chart, Pareto chart, Scatter plot, Stem-and-leaf plot, Parallel coordinates, etc.

## What is Euclidean Distance?

Euclidean distance in mathematics is the “ordinary” (i.e. straight-line) distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as a Pythagorean metric. A generalized term for the Euclidean norm is the L2 norm or L2 distance. The Euclidean distance between points p and q is the length of the line segment connecting them. In Cartesian coordinates, if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula.

## What is Estimation?

Estimation is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is derived from the best information available. Typically, estimation involves “using the value of a statistic derived from a sample to estimate the value of a corresponding population parameter”. The sample provides information that can be projected, through various formal or informal processes, to determine a range most likely to describe the missing information. An estimate that turns out to be incorrect will be an overestimate if the estimate exceeded the actual result or an underestimate if the estimate fell short of the actual result. Estimation is often done by sampling, which is counting a small number of examples and projecting that number onto a larger population. Estimates can similarly be generated by projecting results from polls or surveys onto the entire population. Estimation is important in business and economics because too many variables exist to figure out how large-scale activities will develop.

## What is Eigenvectors?

Eigenvectors are a special set of vectors associated with a linear system of equations that are sometimes also known as characteristic roots, proper values, or latent roots. The determination of the eigenvectors and eigenvalues of a system is extremely important in physics and engineering, where it is equivalent to matrix diagonalization and arises in such common applications as stability analysis, the physics of rotating bodies, and small oscillations of vibrating systems, to name only a few. Each eigenvector is paired with a corresponding so-called eigenvalue. Mathematically, two different kinds of eigenvectors need to be distinguished: left eigenvectors and right eigenvectors.

## Machine learning – deep learning

Deep learning refers to a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for pattern classifi- cation and for feature or representation learning [10]. It lies in the intersections of several research areas, including neural networks, graphical modeling, optimization, pattern recognition, and signal processing, etc. [5] Yann LeCun adopted the deep supervised backpropagation convolutional network for digit recognition. In the recent past, it has become a valuable research topic in the fields of both computer vision and machine learning where deep learning achieves state-of-the art results for a variety of tasks. The deep convolutional neural networks (CNNs) proposed by Hinton came out first in the image classification task of Imagenet classification with deep convolutional neural networks. The model was trained on more than one million images, and has achieved a winning top-5 test error rate of 15.3% over 1, 000 classes. After that, some recent works got better results by improving CNN models. The top-5 test error rate decreased to 13.24% in by training the model to simultaneously classify, locate and detect objects. Besides image classification, the object detection task can also benefit from the CNN model, as reported in. Generally speaking, three important reasons for the popularity of deep learning today are drastically increased chip processing abilities (e.g., GPU units), the significantly lower cost of computing hardware, and recent advances in machine learning and signal/information processing research. Over the past several years, a rich family of deep learning techniques has been proposed and extensively studied, e.g., Deep Belief Network (DBN), Boltzmann Machines (BM), Restricted Boltzmann Machines (RBM), Deep Boltzmann Machine (DBM), Deep Neural 6 Networks (DNN), etc. Among various techniques, the deep convolutional neural networks, which is a discriminative deep architecture and belongs to the DNN category, has found state-of-the-art performance on various tasks and competitions in computer vision and image recognition. Specifically, the CNN model consists of several convolutional layers and pooling layers, which are stacked up with one on top of another. The convolutional layer shares many weights, and the pooling layer sub-samples the output of the convolutional layer and reduces the data rate from the layer below. The weight sharing in the convolutional layer, together with appropriately chosen pooling schemes, endows the CNN with some invariance properties (e.g., translation invariance). My work is similar to the work of Ji Wan et al.[10] but differs from them in the sense that the dataset I am using is different from the ones they have used in their study. Also my approach of image matching will be completely novel which has not been used in any study similar to mine.

Distance metric learning (DML) is an important concept of image retrieval which has been studied very extensively in machine learning [[4], [7]]. In this section I will discuss some already existing work for DML which can be organized by different leaning settings and principles. Most of the current DML studies work with 2 types of data or side information when dealing with training data formats: pairwise constraints where the constraints for must-link and cannot-link are given and triplet constraints which consists of similar and dissimilar pair. There are studies which use the class labels directly for DML by following a typical machine learning scheme like large margin nearest neighbor (LMNN) algorithm. I have gone with the use of class labels directly for DML. There are typically 2 groups into which distance metric learning can be categorized with respect to different learning techniques: the local supervised approach, where metric learning is done on the local sense when the given local constraints from neighboring 7 information are satisfied, and the global supervised approach where all the constraints are satisfied simultaneously for metric learning on a global setting. Most of the current DML studies use the batch learning method as a learning methodology where before the training task the whole collection of training data must be given and a model is trained from scratch. The key concept on which distance metric learning is based is that for an optimal metric the distance between similar images should be minimized and distance between dissimilar images is maximized.

The image features are treated as words in order to apply the Bag of Words model to image classification [9]. A bag of visual words [13] in computer vision is defined as a vector of occurrence counts of a vocabulary of local image features [7]. In my project I used a dictionary of 40 words. To compute the key-points I first used SURF and then compared the results with SIFT so as to be sure which was working better for our project. SURF is a robust local feature detector. It uses an integer approximation to the determinant of Hessian blob detector, which can be computed extremely quickly with an integral image (3 integer operations). I used the HESSIAN THRESHOLD as 600. SIFT is an algorithm to detect and describe local features in images. The local image gradients are measured at the selected scale in the region around each key-point. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination. The method proposed by me used JSEG segmentation to segment the query image into regions. I then extracted color features to describe each region and SURF features from the entire image. The texture features were extracted using Gabor Filters. After all features are extracted I computed bag of words using the SURF of each region and combine with color and texture to generate feature vector. A random forest classifier was used to assign a class to each region and then we compute a similarity score against every image on the dataset based on the current region labels and rank them by this score. After ranking the images the top n (n is the number of resultant images required by user) images were retrieved based

The images in our dataset contain annotations of different regions in the form of XML files. The Extensible Markup Language(XML) annotations provide the annotated image description of each image in the dataset as shown in fig.3.1. With the help of XML annotations we generate a mask which gives us the region masks of that image. The combination of region masks and the XML annotations is used to generate descriptions of the image based on 3 main features. The color features, texture features and description of images using key-points and Bag of words. These annotated image description are stored in an index in the form of a dictionary so as to easily access them.

We use a combination of dominant color, average color, dominant channel and fuzzy color histogram to describe the color feature. The dominant color uses representative colors to characterize the color information in the required region of an image thus making it a compact and efficient descriptor. Local features of an image can be well represented by a dominant color descriptor which helps in fast and efficient retrieval of images from large datasets. The average color descriptor returns the average of all colors present in the image and compares to it. The dominant channel descriptor takes into consideration the dominant tone per channel and returns the percentage of the dominant channels. Fuzzy 3D color histograms are required to compute dominant color. Fuzzy version are more balanced for colors that fall between color bins. We have used only 8 color bins in this project.

We used Gabor Filter as a texture feature descriptor. Gabor Filter is a linear filter used for edge detection. It is an image filter that can be used to describe texture of the image. The Gabor Filters are of any arbitrary size and orientation and are good to detect edge orientations in images. The only drawback of Gabor Filters is that it is scale-sensitive. We also added the average and standard deviation of brightness for each region to complement the information provided by the Gabor Filter.

The query image is a user input image which he wants to use as a sample to retrieve images from the dataset. The query image can be from any source and need not be from our dataset. The system takes the input query image and uses JSEG segmentation which is explained in the next section to segment the image. The segmented image is used to generate the region masks and from these region masks feature extraction takes place which gives a feature vector as explained in fig.3.2. The feature vectors are passed into a region classifier (in our system it is a random forest classifier) which gives classified regions as shown in fig.3.4.

We needed segmentation of images for the retrieval part of the project. We used JSEG segmentation for this project as it is considered to be one of the best segmentation algorithm around for segmenting color images. The reason for this is that it takes into consideration not only color but also the texture while segmenting the image. Images are segmented in an unsupervised manner based on color-texture regions by JSEG which is includes performing color quantization and spatial segmentation independently. In the color quantization step, 12 the regions in the image are differentiated by quantizing the colors in the image to several representative classes. A class map of the image is then formed by replacing the image pixels by their corresponding color class labels [3]. As shown in fig.3.1 the segmentation part is more of a black box method. Fig.3.3 shows segmentation of a sample query image. For this project we used the already implemented version of the algorithm and made a script to process the images with the application. T

Once the image is segmented using JSEG we get the region masks based on the segmentation of region. These region masks are used for feature extraction so as to give a feature vector on which we apply a region classifier i.e. Random Forest in this case so as to give us classified regions like in fig.3.4 . 13 We have color coded the 9 classes in the region classification for our convenience. In the image shown in fig.3.4 dark blue region is classified as water, light blue is sky, brown is ground, dark gray is unknown and yellow is mountain. We store the region classification of all the images in our dataset match them with the region classification of the query image as mentioned in the next section.

We take a query image and segment it using JSEG algorithm. It returns us segmented region mask as .gif files. As openCV cannot upload the .gif files we use the Pygame library to load it as a 2D array so that it can be processed. Once we get the 2D array we separate out the region masks for each region in the image. We calculate the class percentage of each region in the query image and match it with class percentage of all the other images in the database so as to generate a similarity measure score. It works using ”Histogram Intersection” which means taking two histograms and choosing the minimum value on each bin. Then, you add 14 those values and the result is the similarity score. The images with the higher similarity score are returned based on how many images you want.

## Machine learning 3

There it was hinted, by means of a simple example, that in order to obtain a good representation of the process being modelled, one needs to estimate the model complexity,

parameters and noise characteristics. In addition, it was mentioned that it is beneficial

to incorporatea priori knowledge so as to mitigate the ill-conditioned nature of the

learning problem. If we follow these specifications, we can almost assuredly obtain a

model that generalises well.

This chapter will briefly review the classical approaches to learning and generalisation in the neural networks field. Aside from regularisation with noise and committees of estimators, most of the standard methods fall into two broadly overlapping

categories: penalised likelihood and predictive assessment methods. Penalised likelihood methods involve placing a penalty term either on the model dimension or on the

smoothness of the response (Hinton, 1987; Le Cun et al., 1990; Poggio and Girosi,

1990). Predictive assessment strategies, such as the cross-validation, jacknife or bootstrap methods (Ripley, 1996; Stone, 1974; Stone, 1978; Wahba and Wold, 1969),

typically entail dividing the training data set into distinct subsets. The model is

subsequently trained using ~ M of the subsets and its performance is validated on

the omitted subset. The procedure is repeated for each of the subsets. This predictive assessment is often used to set the penalty parameters in the penalised likelihood

formulations.

These methods tend to lack a general and rigorous framework for incorporatinga

prioriknowledge into the modelling process. Furthermore, they do not provide suitable foundations for the study of generalisation in sequential learning. To surmount

these limitations, the Bayesian learning paradigm will be adopted in this thesis. This

approach will allow us to incorporatea priori knowledge into the modelling process

and to compute, jointly and within a probabilistic framework, the model parameters,

23

Learning and Generalisation 24

noise characteristics, model structure and regularisation coefficients. It will also allow

us to do this sequentially.

## Machine learning 2

Many physical processes may be described by the following nonlinear, multivariate variable over the data. Depending on how the data is gathered, we can identify two types of learning: batch learning and sequential learning. In the context of batch

learning, the learning problem involves computing an approximation to the function and estimating the characteristics of the noise process given a set of I input-output

observations:

In contrast, in the sequential learning scenario, the observations arrive one at a time.

Typical instances of the learning problem include regression, where is continuous; classification, where @ corresponds to a discrete group of classes; and nonlinear dynamical system identification, where the inputs and targets correspond to several

delayed versions of the signals under consideration.

The disturbances may represent both measurement noise and unknown inputs.

This study will assume that they can be added directly to the output. The basis for

this assumption is that noise in the input together with other system disturbances will

propagate through the system and therefore can be lumped into one single measurement noise term (Ljung, 1987). When introducing sequential Monte Carlo methods in

Chapter 6, this assumption will be weakened by adopting a more general formulation:

In some scenarios, one might, however, be interested in modelling the

distribution of the input data0 (Cornford et al., 1998; Wright, 1998). This topic,

however, lies beyond the scope of this thesis.

The goal of learning, as posed here, is to obtain a description of the conditional

distribution0 . As the dimension of this distribution can be very large, it is convenient to adopt a variational approach and project it into a lower dimensional space.

This can be accomplished by introducing a set of parameters leading to the

distribution0 . For example, if we believe that the data has been generated by

a Gaussian distribution, we only need two sufficient statistics to describe it, namely its

mean and covariance. These statistics can, in turn, be described by a low-dimensional

set of parameters. These parameters will allow us to infer the outputs@ whenever we

observe new values of the inputs .

The regression function of is a multivariate, nonlinear and

is adopted to denote all the observations corresponding to the th output

. To simplify the notation, is equivalent to That is, if one index does not

appear, it is implied that we are referring to all of its possible values. Similarly, is equivalent to

The shorter notation will be favoured. The longer notation will only be invoked to avoid ambiguities and

emphasise certain dependencies.

Introduction 16

possibly time-varying mapping. When the exact nonlinear structure of this mapping

cannot be establisheda priori, it may be synthesised as a combination of parametrised

basis functions. That is:

denotes a multivariate basis function. These multivariate basis

functions may be generated from univariate basis functions using radial basis, tensor product or ridge construction methods. This type of modelling is often referred to

as “non-parametric” regression because the number of basis functions is typically very

large. Equation (1.2) encompasses a large number of nonlinear estimation methods

including projection pursuit regression (Friedman and Stuetzle, 1981; Huber, 1985),

Volterra series (Billings, 1980; Mathews, 1991), fuzzy inference systems (Jang and Sun,

1993), generalised linear models (Nelder and Wedderburn, 1972), multivariate adaptive regression splines (MARS) (Denison, 1998; Friedman, 1991) and many artificial

neural network paradigms such as functional link networks (Pao, 1989), multi-layer

perceptrons (MLPs) (Rosenblatt, 1959; Rumelhart et al., 1986), radial basis function

networks (RBFs) (Lowe, 1989; Moody and Darken, 1988; Poggio and Girosi, 1990),

wavelet networks (Bakshi and Stephanopoulos, 1993; Juditsky et al., 1995) and hinging hyper-planes (Breiman, 1993). For an introduction to neural networks, the reader

may consult any of the following books (Bishop, 1995b; Haykin, 1994; Hecht-Nielsen,

1990; Ripley, 1996).

Neural networks can approximate any continuous function arbitrarily well as the

number of neurons (basis functions) increases without bound (Cybenko, 1989; Hornik

et al., 1989; Poggio and Girosi, 1990). In addition, they have been successfully applied

to many complex problems, including speech recognition (Robinson, 1994), hand written digit recognition (Le Cun et al., 1989), financial modelling (Refenes, 1995) and

medical diagnosis (Baxt, 1990) among others. This thesis will consider two types of

neural network architectures: fixed dimension MLPs and variable dimension RBFs.

MLPs have enjoyed a privileged position in the neural networks community because

of their simplicity, approximating power, relation to biological systems and various historical reasons. Figure 1.2 shows a typical two hidden layer MLP with logistic sigmoid

basis functions in the hidden layers and a single output linear neuron. Networks of this

type can be represented mathematically as follows:

containing the weights connecting each input with the* th neuron. The logistic sigmoid

Introduction 17

function is given by:

If our goal is to perform classification, then it is convenient to employ a logistic funcb

Figure 1.2 Typical multi-layer perceptron architecture.

tion in the output layer. This allows us to interpret the outputs of the network as

probabilities of class membership. Although the MLPs discussed in this thesis exhibit

a feed-forward architecture, they can be easily extended to recurrent schemes by the

addition of multiple feedback connections or tapped delay lines (de Freitas et al., 1996;

Narendra and Parthasarathy, 1990; Puskorius and Feldkamp, 1994; Qin et al., 1992;

Sjoberg, ¤ 1995; Yamada and Yabuta, 1993).

RBF networks tend to be more tractable than MLPs. In these models, the training

of the parameters corresponding to different layers is, to a large extent, decoupled.

Chapters 5 and 7 will discuss an approximation scheme consisting of a mixture of

RBFs and a linear regression term (Holmes and Mallick, 1998). The number of basis

functions will be estimated from the data. Thus, unless the data is nonlinear, the model

collapses to a standard linear model. More precisely, the linear-RBF model is given

by:

Introduction 18

where, in this case,, p ^ p denotes a distance metric (usually Euclidean or

Mahalanobis),N = denotes the -th RBF centre for a model with RBFs, 5N =

denotes the -th RBF amplitude and N = and N = 1 = denotes the linear

regression parameters. Figure 1.3 depicts the approximation model for

and T . Depending on our a priori knowledge about the smoothness of the mapping,

Figure 1.3 Linear-RBF approximation model with three radial basis functions, two inputs and two

outputs. The solid lines indicate weighted connections.

we can choose different types of basis functions (Girosi et al., 1995). The most common

choices are:

For the last two choices of basis functions, ( will be treated as a user-set parameter.

Nevertheless, the Monte Carlo estimation strategies described in Chapters 5, 6 and 7

can treat the choice of basis functions as a model selection problem. It is possible to

place a prior distribution on the basis functions and allow the Monte Carlo algorithms

to decide which of them provide a better solution.

Introduction 19

1.2 Scope and Contributions of this Thesis

In the example presented earlier, it was shown that the ability of a model to predict accurately with novel data depends on the amount of data, the complexity of the model

and the noise in the data. It was then argued that artificial neural networks provide a

general and flexible nonlinear modelling strategy. From this standpoint, the learning

problem involves estimating the neural network’s parameters, the number of parameters, the type of basis functions and the statistics of the noise. In addition, we might

have to select the most appropriate set of input signals.

A great deal of effort has been devoted to the solution of the parameter estimation

problem. The other problems have received less attention. In contrast, the issues of

noise estimation and model selection will be central to the scope of this thesis. It will be

possible to manage these more demanding tasks by embracing the Bayesian learning

paradigm. Despite the fact that the problems of input variable selection and basis

function selection are not treated explicitly, the solution to these is a natural extension

of the model selection frameworks presented in Chapters 6 and 7.

Another important theme in this thesis is the issue of sequential learning and inference. Sequential training methods for neural networks are important in many applications involving real-time signal processing, where data arrival is inherently sequential.

Furthermore, one might wish to adopt a sequential processing strategy to deal with

non-stationarity in signals, so that information from the recent past is given greater

weight than information from the distant past. Computational simplicity in the form

of not having to store all the data might also constitute an additional motivating factor

for sequential methods.

This thesis proposes the following:

A novel approach to perform regularisation in sequential learning. This approach

establishes theoretical links between extended Kalman filters with adaptive noise

estimation, gradient descent methods with multiple adaptive learning rates and

training methods with multiple smoothing regularisation coefficients.

An expectation maximisation (EM) algorithm to estimate the parameters of an

MLP, the noise statistics and the model uncertainty jointly. The method is applicable to non-stationary parameter spaces.

A robust Bayesian method to estimate, jointly, the parameters, number of parameters, noise statistics and signal to noise ratios of an RBF network. The necessary

computations are performed using a reversible jump Markov chain Monte Carlo

(MCMC) simulation method. In addition, it presents an efficient reversible jump

MCMC simulated annealing strategy to perform global optimisation of RBF net-

Introduction 20

works. Furthermore, it proves the convergence of these algorithms rigorously2.

The use of particle filters and sequential Monte Carlo (SMC) methods to the neural networks field. In doing so, new SMC algorithms are devised to deal with the

high dimensional parameter spaces inherent to neural network models. These algorithms are suitable for nonlinear, non-Gaussian and non-stationary modelling.

A new and general sequential Monte Carlo approach to perform sequential noise

estimation and model selection. The method is demonstrated on RBF networks.

1.3 Thesis Organisation

The chapters are, to a large extent, self contained and can be read independently. Chapter 2 is of an introductory nature. At the end of each of the main chapters, Chapters

3 to 7, the proposed methods and algorithms are demonstrated on experiments with

synthetic data. In Chapter 8, further tests on a few real problems are presented. A

summary of the thesis follows:

Chapter 2: Learning and Generalisation

This chapter provides a brief review of learning theory from a neural networks perspective. It addresses both the classical and Bayesian approaches. In addition, it introduces

the sequential learning problem.

Chapter 3: Sequential Bayesian Learning with Gaussian Approximations

Sequential learning methods, in particular Gaussian approximation schemes, are introduced in this chapter. It is shown that an hierarchical Bayesian modelling approach

enables one to perform regularisation in sequential learning. Three inference levels

are identified within this hierarchy, namely model selection, parameter estimation and

noise estimation. In environments where data arrives sequentially, techniques such

as cross-validation to achieve regularisation or model selection are not possible. The

Bayesian approach, with extended Kalman filtering at the parameter estimation level,

allows for regularisation within a minimum variance framework. A multi-layer perceptron is used to generate the extended Kalman filter nonlinear measurements mapping.

Several algorithms are described at the noise estimation level, thus permitting the implementation of on-line regularisation. Another contribution of this chapter is to show

2These contributions were strongly motivated by the work of Christophe Andrieu and Arnaud Doucet

(Andrieu and Doucet, 1998b; Andrieu and Doucet, 1998a; Andrieu and Doucet, 1999).

Introduction 21

the theoretical links between adaptive noise estimation in extended Kalman filtering,

multiple adaptive learning rates and multiple smoothing regularisation coefficients.

Chapter 4: Dynamic Batch Learning with the EM Algorithm

This chapter extends the sequential Gaussian approximation framework discussed in

the previous chapter to the batch learning scenario. In it, an EM algorithm for nonlinear

state space models is derived. It is used to estimate jointly the neural network weights,

the model uncertainty and the noise in the data. In the E-step, a forward-backward

Rauch-Tung-Striebel smoother is adopted to compute the network weights. For the

M-step, analytical expressions are derived to compute the model uncertainty and the

measurement noise. The method is shown to be intrinsically very powerful, simple and

stable.

Chapter 5: Robust Full Bayesian Learning with MCMC

This chapter begins the presentation of Monte Carlo methods, a major theme in this

thesis. The reversible jump MCMC simulation algorithm is applied to RBF networks, so

as to compute the joint posterior distribution of the radial basis centres and the number

of basis functions. This area of research is advanced in three important directions.

First, a robust prior for RBF networks is proposed. That is, the results do not depend

on any heuristics or thresholds. Second, an automated growing and pruning reversible

jump MCMC optimisation algorithm is designed to choose the model order according

to classical AIC, BIC and MDL criteria. This MCMC algorithm estimates the maximum

of the joint likelihood function of the radial basis centres and the number of bases using

simulated annealing. Finally, some geometric convergence theorems for the proposed

algorithms are presented.

Chapter 6: Sequential Monte Carlo Methods

Here, a novel strategy for training neural networks using sequential Monte Carlo (SMC)

algorithms is discussed. Various hybrid gradient descent/sampling importance resampling algorithms are proposed. In terms of both modelling flexibility and accuracy, SMC

algorithms provide a clear improvement over conventional Gaussian schemes. These

algorithms may be viewed as a global learning strategy to learn the probability distributions of the network weights and outputs in a sequential framework. They are also well

suited to applications involving on-line, nonlinear and non-Gaussian signal processing.

Introduction 22

Chapter 7: Sequential Bayesian Model Selection

This chapter extends the model selection strategy discussed in Chapter 5 to the sequential learning case. This problem does not usually admit any type of closed-form

analytical solutions and, as a result, one has to resort to numerical methods. The chapter proposes an original sequential simulation-based strategy to perform the necessary

computations. It combines sequential importance sampling, a selection procedure and

reversible jump MCMC moves. The effectiveness of the method is demonstrated by

applying it to radial basis function networks.

Chapter 8: Applications

This chapter demonstrates the performance of the various methods on some interesting real data sets. It includes comprehensive comparisons between the proposed algorithms.

Chapter 9: Conclusions

This final chapter summarises the theoretical and experimental results. It discusses

their relevance and suggests a few directions for further research

## Machine Learning1

Models are abstractions of reality to which experiments can be applied to improve

our understanding of phenomena in the world. They are at the heart of science and

permeate throughout most disciplines of human endeavour, including economics, engineering, medicine, politics, sociology and data management in general. Models can

be used to process data to predict future events or to organise data in ways that allow

information to be extracted from it.

There are two common approaches to constructing models. The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that

can be expressed by accepted relationships and physical laws. These subsystems are

typically arranged in the form of simulation blocks and sets of differential equations.

The model is consequently obtained by combining all the sub-models.

The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short. In this context, learning implies finding patterns in the data

or obtaining a parsimonious representation of data that can then be used for several

purposes such as forecasting or classification. Learning is of paramount importance

in nonlinear modelling due to the lack of a coherent and comprehensive theory for

nonlinear systems.

Learning from data is an ill-posed problem. That is, there is a continuum of solutions for any particular data set. Consequently, certain restrictions have to be imposed

on the form of the solution. Oftena priori knowledge about the phenomenon being

modelled is available in the form of physical laws, generally accepted heuristics or

mathematical relations. This knowledge should be incorporated into the modelling

process so as to reduce the set of possible solutions to one that provides reasonable results. The ill-posed nature and other inherent difficulties associated with the problem

12

Introduction 13

of learning can be clearly illustrated by means of a simple noisy interpolation example.

Consider the data plotted in Figure 1.1-A. It has been generated by the following

equation:

where represents the true function between the input and the output and

denotes zero mean uniformly distributed noise. The learning task is to use the noisy

data points plotted in Figure 1.1-A to estimate the true relation between the input and

the output.

We could attempt to model the data by polynomials of different order fitted to the

data by conventional least squares techniques. Let us assume that we try to fit second

and sixth order polynomials to the data. As shown in Figure 1.1-B, the 6th order

polynomial approximates the data better than the second order polynomial. However,

if we plot the true function and the two estimators as in Figure 1.1-C, we find that the

second order estimator provides a better approximation to the true function. Moreover,

the second order estimator provides a far better approximation to the true function for

novel (extrapolation) data, as depicted in Figure 1.1-D.

In conclusion, very complex estimators will approximate the training data points

better but may be worse estimators of the true function. Consequently, their predictions

for samples not encountered in the training data set may be worse than the predictions

produced by lower complexity estimators. The ability to predict well with samples

not encountered in the training data set is usually referred to as generalisation in the

machine learning literature. Note that if we had known the attributes of the noise

terma priori, we could have inferred that the 6th order polynomial was fitting it.

Alternatively, if we had had any data in the intervalM M # , we would have noticed

the problems associated with using the 6th order polynomial. The last two remarks

indicate, clearly, thata priori knowledge and the size and scope of the data set play a

significant role in learning.

The previous simple example has unveiled several of the difficulties that arise when

we try to infer models from noisy data, namely:

The learning problem is ill-posed. It contains infinitely many solutions.

Noise and limited training data pose limitations on the generalisation performance of the estimated models.

We have to select a set of nonlinear model structures with enough capacity to

approximate the true function.

We need techniques for fitting the selected models to the data