What is Explanatory Data Analysis?

Explanatory Data Analysis (EDA) in statistics is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA. Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians’ work on scientific and engineering problems. There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude than by particular techniques. Typical graphical techniques used in EDA are: Box plot, Histogram, Multi-vari chart, Run chart, Pareto chart, Scatter plot, Stem-and-leaf plot, Parallel coordinates, etc.

What is Euclidean Distance?

Euclidean distance in mathematics is the “ordinary” (i.e. straight-line) distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as a Pythagorean metric. A generalized term for the Euclidean norm is the L2 norm or L2 distance. The Euclidean distance between points p and q is the length of the line segment connecting them. In Cartesian coordinates, if p = (p1, p2,…, pn) and q = (q1, q2,…, qn) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula.

What is Estimation?

Estimation is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is derived from the best information available. Typically, estimation involves “using the value of a statistic derived from a sample to estimate the value of a corresponding population parameter”. The sample provides information that can be projected, through various formal or informal processes, to determine a range most likely to describe the missing information. An estimate that turns out to be incorrect will be an overestimate if the estimate exceeded the actual result or an underestimate if the estimate fell short of the actual result. Estimation is often done by sampling, which is counting a small number of examples and projecting that number onto a larger population. Estimates can similarly be generated by projecting results from polls or surveys onto the entire population. Estimation is important in business and economics because too many variables exist to figure out how large-scale activities will develop.

What is Eigenvectors?

Eigenvectors are a special set of vectors associated with a linear system of equations that are sometimes also known as characteristic roots, proper values, or latent roots. The determination of the eigenvectors and eigenvalues of a system is extremely important in physics and engineering, where it is equivalent to matrix diagonalization and arises in such common applications as stability analysis, the physics of rotating bodies, and small oscillations of vibrating systems, to name only a few. Each eigenvector is paired with a corresponding so-called eigenvalue. Mathematically, two different kinds of eigenvectors need to be distinguished: left eigenvectors and right eigenvectors.

Machine learning – deep learning

Deep learning refers to a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for pattern classifi- cation and for feature or representation learning [10]. It lies in the intersections of several research areas, including neural networks, graphical modeling, optimization, pattern recognition, and signal processing, etc. [5] Yann LeCun adopted the deep supervised backpropagation convolutional network for digit recognition. In the recent past, it has become a valuable research topic in the fields of both computer vision and machine learning where deep learning achieves state-of-the art results for a variety of tasks. The deep convolutional neural networks (CNNs) proposed by Hinton came out first in the image classification task of Imagenet classification with deep convolutional neural networks. The model was trained on more than one million images, and has achieved a winning top-5 test error rate of 15.3% over 1, 000 classes. After that, some recent works got better results by improving CNN models. The top-5 test error rate decreased to 13.24% in by training the model to simultaneously classify, locate and detect objects. Besides image classification, the object detection task can also benefit from the CNN model, as reported in. Generally speaking, three important reasons for the popularity of deep learning today are drastically increased chip processing abilities (e.g., GPU units), the significantly lower cost of computing hardware, and recent advances in machine learning and signal/information processing research. Over the past several years, a rich family of deep learning techniques has been proposed and extensively studied, e.g., Deep Belief Network (DBN), Boltzmann Machines (BM), Restricted Boltzmann Machines (RBM), Deep Boltzmann Machine (DBM), Deep Neural 6 Networks (DNN), etc. Among various techniques, the deep convolutional neural networks, which is a discriminative deep architecture and belongs to the DNN category, has found state-of-the-art performance on various tasks and competitions in computer vision and image recognition. Specifically, the CNN model consists of several convolutional layers and pooling layers, which are stacked up with one on top of another. The convolutional layer shares many weights, and the pooling layer sub-samples the output of the convolutional layer and reduces the data rate from the layer below. The weight sharing in the convolutional layer, together with appropriately chosen pooling schemes, endows the CNN with some invariance properties (e.g., translation invariance). My work is similar to the work of Ji Wan et al.[10] but differs from them in the sense that the dataset I am using is different from the ones they have used in their study. Also my approach of image matching will be completely novel which has not been used in any study similar to mine.

Numerai deep learning example

Distance metric learning (DML) is an important concept of image retrieval which has been studied very extensively in machine learning [[4], [7]]. In this section I will discuss some already existing work for DML which can be organized by different leaning settings and principles. Most of the current DML studies work with 2 types of data or side information when dealing with training data formats: pairwise constraints where the constraints for must-link and cannot-link are given and triplet constraints which consists of similar and dissimilar pair. There are studies which use the class labels directly for DML by following a typical machine learning scheme like large margin nearest neighbor (LMNN) algorithm. I have gone with the use of class labels directly for DML. There are typically 2 groups into which distance metric learning can be categorized with respect to different learning techniques: the local supervised approach, where metric learning is done on the local sense when the given local constraints from neighboring 7 information are satisfied, and the global supervised approach where all the constraints are satisfied simultaneously for metric learning on a global setting. Most of the current DML studies use the batch learning method as a learning methodology where before the training task the whole collection of training data must be given and a model is trained from scratch. The key concept on which distance metric learning is based is that for an optimal metric the distance between similar images should be minimized and distance between dissimilar images is maximized.

Numerai deep learning example

The image features are treated as words in order to apply the Bag of Words model to image classification [9]. A bag of visual words [13] in computer vision is defined as a vector of occurrence counts of a vocabulary of local image features [7]. In my project I used a dictionary of 40 words. To compute the key-points I first used SURF and then compared the results with SIFT so as to be sure which was working better for our project. SURF is a robust local feature detector. It uses an integer approximation to the determinant of Hessian blob detector, which can be computed extremely quickly with an integral image (3 integer operations). I used the HESSIAN THRESHOLD as 600. SIFT is an algorithm to detect and describe local features in images. The local image gradients are measured at the selected scale in the region around each key-point. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination. The method proposed by me used JSEG segmentation to segment the query image into regions. I then extracted color features to describe each region and SURF features from the entire image. The texture features were extracted using Gabor Filters. After all features are extracted I computed bag of words using the SURF of each region and combine with color and texture to generate feature vector. A random forest classifier was used to assign a class to each region and then we compute a similarity score against every image on the dataset based on the current region labels and rank them by this score. After ranking the images the top n (n is the number of resultant images required by user) images were retrieved based

Numerai deep learning example

The images in our dataset contain annotations of different regions in the form of XML files. The Extensible Markup Language(XML) annotations provide the annotated image description of each image in the dataset as shown in fig.3.1. With the help of XML annotations we generate a mask which gives us the region masks of that image. The combination of region masks and the XML annotations is used to generate descriptions of the image based on 3 main features. The color features, texture features and description of images using key-points and Bag of words. These annotated image description are stored in an index in the form of a dictionary so as to easily access them.

Numerai deep learning example

We use a combination of dominant color, average color, dominant channel and fuzzy color histogram to describe the color feature. The dominant color uses representative colors to characterize the color information in the required region of an image thus making it a compact and efficient descriptor. Local features of an image can be well represented by a dominant color descriptor which helps in fast and efficient retrieval of images from large datasets. The average color descriptor returns the average of all colors present in the image and compares to it. The dominant channel descriptor takes into consideration the dominant tone per channel and returns the percentage of the dominant channels. Fuzzy 3D color histograms are required to compute dominant color. Fuzzy version are more balanced for colors that fall between color bins. We have used only 8 color bins in this project.

Numerai deep learning example

We used Gabor Filter as a texture feature descriptor. Gabor Filter is a linear filter used for edge detection. It is an image filter that can be used to describe texture of the image. The Gabor Filters are of any arbitrary size and orientation and are good to detect edge orientations in images. The only drawback of Gabor Filters is that it is scale-sensitive. We also added the average and standard deviation of brightness for each region to complement the information provided by the Gabor Filter.

Numerai deep learning example

The query image is a user input image which he wants to use as a sample to retrieve images from the dataset. The query image can be from any source and need not be from our dataset. The system takes the input query image and uses JSEG segmentation which is explained in the next section to segment the image. The segmented image is used to generate the region masks and from these region masks feature extraction takes place which gives a feature vector as explained in fig.3.2. The feature vectors are passed into a region classifier (in our system it is a random forest classifier) which gives classified regions as shown in fig.3.4.

Numerai deep learning example

We needed segmentation of images for the retrieval part of the project. We used JSEG segmentation for this project as it is considered to be one of the best segmentation algorithm around for segmenting color images. The reason for this is that it takes into consideration not only color but also the texture while segmenting the image. Images are segmented in an unsupervised manner based on color-texture regions by JSEG which is includes performing color quantization and spatial segmentation independently. In the color quantization step, 12 the regions in the image are differentiated by quantizing the colors in the image to several representative classes. A class map of the image is then formed by replacing the image pixels by their corresponding color class labels [3]. As shown in fig.3.1 the segmentation part is more of a black box method. Fig.3.3 shows segmentation of a sample query image. For this project we used the already implemented version of the algorithm and made a script to process the images with the application. T

Numerai deep learning example

Once the image is segmented using JSEG we get the region masks based on the segmentation of region. These region masks are used for feature extraction so as to give a feature vector on which we apply a region classifier i.e. Random Forest in this case so as to give us classified regions like in fig.3.4 . 13 We have color coded the 9 classes in the region classification for our convenience. In the image shown in fig.3.4 dark blue region is classified as water, light blue is sky, brown is ground, dark gray is unknown and yellow is mountain. We store the region classification of all the images in our dataset match them with the region classification of the query image as mentioned in the next section.

Numerai deep learning example

We take a query image and segment it using JSEG algorithm. It returns us segmented region mask as .gif files. As openCV cannot upload the .gif files we use the Pygame library to load it as a 2D array so that it can be processed. Once we get the 2D array we separate out the region masks for each region in the image. We calculate the class percentage of each region in the query image and match it with class percentage of all the other images in the database so as to generate a similarity measure score. It works using ”Histogram Intersection” which means taking two histograms and choosing the minimum value on each bin. Then, you add 14 those values and the result is the similarity score. The images with the higher similarity score are returned based on how many images you want.

Machine learning 3

There it was hinted, by means of a simple example, that in order to obtain a good representation of the process being modelled, one needs to estimate the model complexity,
parameters and noise characteristics. In addition, it was mentioned that it is beneficial
to incorporate
a priori knowledge so as to mitigate the ill-conditioned nature of the
learning problem. If we follow these specifications, we can almost assuredly obtain a
model that generalises well.
This chapter will briefly review the classical approaches to learning and generalisation in the neural networks field. Aside from regularisation with noise and committees of estimators, most of the standard methods fall into two broadly overlapping
categories: penalised likelihood and predictive assessment methods. Penalised likelihood methods involve placing a penalty term either on the model dimension or on the
smoothness of the response (Hinton, 1987; Le Cun et al., 1990; Poggio and Girosi,
1990). Predictive assessment strategies, such as the cross-validation, jacknife or bootstrap methods (Ripley, 1996; Stone, 1974; Stone, 1978; Wahba and Wold, 1969),
typically entail dividing the training data set into
distinct subsets. The model is
subsequently trained using
~ M of the subsets and its performance is validated on
the omitted subset. The procedure is repeated for each of the subsets. This predictive assessment is often used to set the penalty parameters in the penalised likelihood
formulations.
These methods tend to lack a general and rigorous framework for incorporating
a
priori
knowledge into the modelling process. Furthermore, they do not provide suitable foundations for the study of generalisation in sequential learning. To surmount
these limitations, the Bayesian learning paradigm will be adopted in this thesis. This
approach will allow us to incorporate
a priori knowledge into the modelling process
and to compute, jointly and within a probabilistic framework, the model parameters,
23

Learning and Generalisation 24
noise characteristics, model structure and regularisation coefficients. It will also allow
us to do this sequentially.

Machine learning 2

Many physical processes may be described by the following nonlinear, multivariate variable over the data. Depending on how the data is gathered, we can identify two types of learning: batch learning and sequential learning. In the context of batch
learning, the learning problem involves computing an approximation to the function 
and estimating the characteristics of the noise process given a set of I input-output
observations:

In contrast, in the sequential learning scenario, the observations arrive one at a time.
Typical instances of the learning problem include regression, where
 is continuous; classification, where @ corresponds to a discrete group of classes; and nonlinear dynamical system identification, where the inputs and targets correspond to several
delayed versions of the signals under consideration.
The disturbances 
may represent both measurement noise and unknown inputs.
This study will assume that they can be added directly to the output. The basis for
this assumption is that noise in the input together with other system disturbances will
propagate through the system and therefore can be lumped into one single measurement noise term (Ljung, 1987). When introducing sequential Monte Carlo methods in
Chapter 6, this assumption will be weakened by adopting a more general formulation:
In some scenarios, one might, however, be interested in modelling the
distribution of the input data
0   (Cornford et al., 1998; Wright, 1998). This topic,
however, lies beyond the scope of this thesis.
The goal of learning, as posed here, is to obtain a description of the conditional
distribution
0 . As the dimension of this distribution can be very large, it is convenient to adopt a variational approach and project it into a lower dimensional space.
This can be accomplished by introducing a set of parameters 
 leading to the
distribution
0 . For example, if we believe that the data has been generated by
a Gaussian distribution, we only need two sufficient statistics to describe it, namely its
mean and covariance. These statistics can, in turn, be described by a low-dimensional
set of parameters. These parameters will allow us to infer the outputs
@ whenever we
observe new values of the inputs
 .
The regression function of 
is a multivariate, nonlinear and
 is adopted to denote all the observations corresponding to the th output
. To simplify the notation, is equivalent to  That is, if one index does not
appear, it is implied that we are referring to all of its possible values. Similarly,
 is equivalent to
The shorter notation will be favoured. The longer notation will only be invoked to avoid ambiguities and
emphasise certain dependencies.

Introduction 16
possibly time-varying mapping. When the exact nonlinear structure of this mapping
cannot be established
a priori, it may be synthesised as a combination of parametrised
basis functions. That is:
denotes a multivariate basis function. These multivariate basis
functions may be generated from univariate basis functions using radial basis, tensor product or ridge construction methods. This type of modelling is often referred to
as “non-parametric” regression because the number of basis functions is typically very
large. Equation (1.2) encompasses a large number of nonlinear estimation methods
including projection pursuit regression (Friedman and Stuetzle, 1981; Huber, 1985),
Volterra series (Billings, 1980; Mathews, 1991), fuzzy inference systems (Jang and Sun,
1993), generalised linear models (Nelder and Wedderburn, 1972), multivariate adaptive regression splines (MARS) (Denison, 1998; Friedman, 1991) and many artificial
neural network paradigms such as functional link networks (Pao, 1989), multi-layer
perceptrons (MLPs) (Rosenblatt, 1959; Rumelhart et al., 1986), radial basis function
networks (RBFs) (Lowe, 1989; Moody and Darken, 1988; Poggio and Girosi, 1990),
wavelet networks (Bakshi and Stephanopoulos, 1993; Juditsky et al., 1995) and hinging hyper-planes (Breiman, 1993). For an introduction to neural networks, the reader
may consult any of the following books (Bishop, 1995b; Haykin, 1994; Hecht-Nielsen,
1990; Ripley, 1996).
Neural networks can approximate any continuous function arbitrarily well as the
number of neurons (basis functions) increases without bound (Cybenko, 1989; Hornik
et al., 1989; Poggio and Girosi, 1990). In addition, they have been successfully applied
to many complex problems, including speech recognition (Robinson, 1994), hand written digit recognition (Le Cun et al., 1989), financial modelling (Refenes, 1995) and
medical diagnosis (Baxt, 1990) among others. This thesis will consider two types of
neural network architectures: fixed dimension MLPs and variable dimension RBFs.
MLPs have enjoyed a privileged position in the neural networks community because
of their simplicity, approximating power, relation to biological systems and various historical reasons. Figure 1.2 shows a typical two hidden layer MLP with logistic sigmoid
basis functions in the hidden layers and a single output linear neuron. Networks of this
type can be represented mathematically as follows:

containing the weights connecting each input with the
* th neuron. The logistic sigmoid
Introduction 17
function is given by:


If our goal is to perform classification, then it is convenient to employ a logistic funcb
Figure 1.2 Typical multi-layer perceptron architecture.
tion in the output layer. This allows us to interpret the outputs of the network as
probabilities of class membership. Although the MLPs discussed in this thesis exhibit
a feed-forward architecture, they can be easily extended to recurrent schemes by the
addition of multiple feedback connections or tapped delay lines (de Freitas et al., 1996;
Narendra and Parthasarathy, 1990; Puskorius and Feldkamp, 1994; Qin et al., 1992;
Sjoberg, ¤ 1995; Yamada and Yabuta, 1993).
RBF networks tend to be more tractable than MLPs. In these models, the training
of the parameters corresponding to different layers is, to a large extent, decoupled.
Chapters 5 and 7 will discuss an approximation scheme consisting of a mixture of

RBFs and a linear regression term (Holmes and Mallick, 1998). The number of basis
functions will be estimated from the data. Thus, unless the data is nonlinear, the model
collapses to a standard linear model. More precisely, the linear-RBF model
 is given
by:

Introduction 18
where, in this case,
, p ^ p denotes a distance metric (usually Euclidean or
Mahalanobis),
N = denotes the -th RBF centre for a model with RBFs, 5N = 
denotes the -th RBF amplitude and  N =  and  N = 1 =  denotes the linear
regression parameters. Figure 1.3 depicts the approximation model for

and T  . Depending on our a priori knowledge about the smoothness of the mapping,

Figure 1.3 Linear-RBF approximation model with three radial basis functions, two inputs and two
outputs. The solid lines indicate weighted connections.
we can choose different types of basis functions (Girosi et al., 1995). The most common
choices are:

For the last two choices of basis functions, ( will be treated as a user-set parameter.
Nevertheless, the Monte Carlo estimation strategies described in Chapters 5, 6 and 7
can treat the choice of basis functions as a model selection problem. It is possible to
place a prior distribution on the basis functions and allow the Monte Carlo algorithms
to decide which of them provide a better solution.

Introduction 19
1.2 Scope and Contributions of this Thesis
In the example presented earlier, it was shown that the ability of a model to predict accurately with novel data depends on the amount of data, the complexity of the model
and the noise in the data. It was then argued that artificial neural networks provide a
general and flexible nonlinear modelling strategy. From this standpoint, the learning
problem involves estimating the neural network’s parameters, the number of parameters, the type of basis functions and the statistics of the noise. In addition, we might
have to select the most appropriate set of input signals.
A great deal of effort has been devoted to the solution of the parameter estimation
problem. The other problems have received less attention. In contrast, the issues of
noise estimation and model selection will be central to the scope of this thesis. It will be
possible to manage these more demanding tasks by embracing the Bayesian learning
paradigm. Despite the fact that the problems of input variable selection and basis
function selection are not treated explicitly, the solution to these is a natural extension
of the model selection frameworks presented in Chapters 6 and 7.
Another important theme in this thesis is the issue of sequential learning and inference. Sequential training methods for neural networks are important in many applications involving real-time signal processing, where data arrival is inherently sequential.
Furthermore, one might wish to adopt a sequential processing strategy to deal with
non-stationarity in signals, so that information from the recent past is given greater
weight than information from the distant past. Computational simplicity in the form
of not having to store all the data might also constitute an additional motivating factor
for sequential methods.
This thesis proposes the following:
A novel approach to perform regularisation in sequential learning. This approach
establishes theoretical links between extended Kalman filters with adaptive noise
estimation, gradient descent methods with multiple adaptive learning rates and
training methods with multiple smoothing regularisation coefficients.
An expectation maximisation (EM) algorithm to estimate the parameters of an
MLP, the noise statistics and the model uncertainty jointly. The method is applicable to non-stationary parameter spaces.
A robust Bayesian method to estimate, jointly, the parameters, number of parameters, noise statistics and signal to noise ratios of an RBF network. The necessary
computations are performed using a reversible jump Markov chain Monte Carlo
(MCMC) simulation method. In addition, it presents an efficient reversible jump
MCMC simulated annealing strategy to perform global optimisation of RBF net-

Introduction 20
works. Furthermore, it proves the convergence of these algorithms rigorously
2.
The use of particle filters and sequential Monte Carlo (SMC) methods to the neural networks field. In doing so, new SMC algorithms are devised to deal with the
high dimensional parameter spaces inherent to neural network models. These algorithms are suitable for nonlinear, non-Gaussian and non-stationary modelling.
A new and general sequential Monte Carlo approach to perform sequential noise
estimation and model selection. The method is demonstrated on RBF networks.
1.3 Thesis Organisation
The chapters are, to a large extent, self contained and can be read independently. Chapter 2 is of an introductory nature. At the end of each of the main chapters, Chapters
3 to 7, the proposed methods and algorithms are demonstrated on experiments with
synthetic data. In Chapter 8, further tests on a few real problems are presented. A
summary of the thesis follows:
Chapter 2: Learning and Generalisation
This chapter provides a brief review of learning theory from a neural networks perspective. It addresses both the classical and Bayesian approaches. In addition, it introduces
the sequential learning problem.
Chapter 3: Sequential Bayesian Learning with Gaussian Approximations
Sequential learning methods, in particular Gaussian approximation schemes, are introduced in this chapter. It is shown that an hierarchical Bayesian modelling approach
enables one to perform regularisation in sequential learning. Three inference levels
are identified within this hierarchy, namely model selection, parameter estimation and
noise estimation. In environments where data arrives sequentially, techniques such
as cross-validation to achieve regularisation or model selection are not possible. The
Bayesian approach, with extended Kalman filtering at the parameter estimation level,
allows for regularisation within a minimum variance framework. A multi-layer perceptron is used to generate the extended Kalman filter nonlinear measurements mapping.
Several algorithms are described at the noise estimation level, thus permitting the implementation of on-line regularisation. Another contribution of this chapter is to show
2These contributions were strongly motivated by the work of Christophe Andrieu and Arnaud Doucet
(Andrieu and Doucet, 1998b; Andrieu and Doucet, 1998a; Andrieu and Doucet, 1999).

Introduction 21
the theoretical links between adaptive noise estimation in extended Kalman filtering,
multiple adaptive learning rates and multiple smoothing regularisation coefficients.
Chapter 4: Dynamic Batch Learning with the EM Algorithm
This chapter extends the sequential Gaussian approximation framework discussed in
the previous chapter to the batch learning scenario. In it, an EM algorithm for nonlinear
state space models is derived. It is used to estimate jointly the neural network weights,
the model uncertainty and the noise in the data. In the E-step, a forward-backward
Rauch-Tung-Striebel smoother is adopted to compute the network weights. For the
M-step, analytical expressions are derived to compute the model uncertainty and the
measurement noise. The method is shown to be intrinsically very powerful, simple and
stable.
Chapter 5: Robust Full Bayesian Learning with MCMC
This chapter begins the presentation of Monte Carlo methods, a major theme in this
thesis. The reversible jump MCMC simulation algorithm is applied to RBF networks, so
as to compute the joint posterior distribution of the radial basis centres and the number
of basis functions. This area of research is advanced in three important directions.
First, a robust prior for RBF networks is proposed. That is, the results do not depend
on any heuristics or thresholds. Second, an automated growing and pruning reversible
jump MCMC optimisation algorithm is designed to choose the model order according
to classical AIC, BIC and MDL criteria. This MCMC algorithm estimates the maximum
of the joint likelihood function of the radial basis centres and the number of bases using
simulated annealing. Finally, some geometric convergence theorems for the proposed
algorithms are presented.
Chapter 6: Sequential Monte Carlo Methods
Here, a novel strategy for training neural networks using sequential Monte Carlo (SMC)
algorithms is discussed. Various hybrid gradient descent/sampling importance resampling algorithms are proposed. In terms of both modelling flexibility and accuracy, SMC
algorithms provide a clear improvement over conventional Gaussian schemes. These
algorithms may be viewed as a global learning strategy to learn the probability distributions of the network weights and outputs in a sequential framework. They are also well
suited to applications involving on-line, nonlinear and non-Gaussian signal processing.

Introduction 22
Chapter 7: Sequential Bayesian Model Selection
This chapter extends the model selection strategy discussed in Chapter 5 to the sequential learning case. This problem does not usually admit any type of closed-form
analytical solutions and, as a result, one has to resort to numerical methods. The chapter proposes an original sequential simulation-based strategy to perform the necessary
computations. It combines sequential importance sampling, a selection procedure and
reversible jump MCMC moves. The effectiveness of the method is demonstrated by
applying it to radial basis function networks.
Chapter 8: Applications
This chapter demonstrates the performance of the various methods on some interesting real data sets. It includes comprehensive comparisons between the proposed algorithms.
Chapter 9: Conclusions
This final chapter summarises the theoretical and experimental results. It discusses
their relevance and suggests a few directions for further research

Machine Learning1

Models are abstractions of reality to which experiments can be applied to improve
our understanding of phenomena in the world. They are at the heart of science and
permeate throughout most disciplines of human endeavour, including economics, engineering, medicine, politics, sociology and data management in general. Models can
be used to process data to predict future events or to organise data in ways that allow
information to be extracted from it.
There are two common approaches to constructing models. The first is of a deductive nature. It relies on subdividing the system being modelled into subsystems that
can be expressed by accepted relationships and physical laws. These subsystems are
typically arranged in the form of simulation blocks and sets of differential equations.
The model is consequently obtained by combining all the sub-models.
The second approach favours the inductive strategy of estimating models from measured data. This estimation process will be referred to as “learning from data” or simply “learning” for short. In this context, learning implies finding patterns in the data
or obtaining a parsimonious representation of data that can then be used for several
purposes such as forecasting or classification. Learning is of paramount importance
in nonlinear modelling due to the lack of a coherent and comprehensive theory for
nonlinear systems.
Learning from data is an ill-posed problem. That is, there is a continuum of solutions for any particular data set. Consequently, certain restrictions have to be imposed
on the form of the solution. Often
a priori knowledge about the phenomenon being
modelled is available in the form of physical laws, generally accepted heuristics or
mathematical relations. This knowledge should be incorporated into the modelling
process so as to reduce the set of possible solutions to one that provides reasonable results. The ill-posed nature and other inherent difficulties associated with the problem
12

Introduction 13
of learning can be clearly illustrated by means of a simple noisy interpolation example.
Consider the data plotted in Figure 1.1-A. It has been generated by the following
equation:

Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

 Deep Learning

Deep Learning 

Deep Learning 

Deep Learning

Deep Learning


where   represents the true function between the input  and the output and 
denotes zero mean uniformly distributed noise. The learning task is to use the noisy
data points plotted in Figure 1.1-A to estimate the true relation between the input and
the output.
We could attempt to model the data by polynomials of different order fitted to the
data by conventional least squares techniques. Let us assume that we try to fit second
and sixth order polynomials to the data. As shown in Figure 1.1-B, the 6th order
polynomial approximates the data better than the second order polynomial. However,
if we plot the true function and the two estimators as in Figure 1.1-C, we find that the
second order estimator provides a better approximation to the true function. Moreover,
the second order estimator provides a far better approximation to the true function for
novel (extrapolation) data, as depicted in Figure 1.1-D.
In conclusion, very complex estimators will approximate the training data points
better but may be worse estimators of the true function. Consequently, their predictions
for samples not encountered in the training data set may be worse than the predictions
produced by lower complexity estimators. The ability to predict well with samples
not encountered in the training data set is usually referred to as generalisation in the
machine learning literature. Note that if we had known the attributes of the noise
term
a priori, we could have inferred that the 6th order polynomial was fitting it.
Alternatively, if we had had any data in the interval
M  M  # , we would have noticed
the problems associated with using the 6th order polynomial. The last two remarks
indicate, clearly, that
a priori knowledge and the size and scope of the data set play a
significant role in learning.
The previous simple example has unveiled several of the difficulties that arise when
we try to infer models from noisy data, namely:
The learning problem is ill-posed. It contains infinitely many solutions.
Noise and limited training data pose limitations on the generalisation performance of the estimated models.
We have to select a set of nonlinear model structures with enough capacity to
approximate the true function.
We need techniques for fitting the selected models to the data