Deep learning refers to a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for pattern classifi- cation and for feature or representation learning . It lies in the intersections of several research areas, including neural networks, graphical modeling, optimization, pattern recognition, and signal processing, etc.  Yann LeCun adopted the deep supervised backpropagation convolutional network for digit recognition. In the recent past, it has become a valuable research topic in the fields of both computer vision and machine learning where deep learning achieves state-of-the art results for a variety of tasks. The deep convolutional neural networks (CNNs) proposed by Hinton came out first in the image classification task of Imagenet classification with deep convolutional neural networks. The model was trained on more than one million images, and has achieved a winning top-5 test error rate of 15.3% over 1, 000 classes. After that, some recent works got better results by improving CNN models. The top-5 test error rate decreased to 13.24% in by training the model to simultaneously classify, locate and detect objects. Besides image classification, the object detection task can also benefit from the CNN model, as reported in. Generally speaking, three important reasons for the popularity of deep learning today are drastically increased chip processing abilities (e.g., GPU units), the significantly lower cost of computing hardware, and recent advances in machine learning and signal/information processing research. Over the past several years, a rich family of deep learning techniques has been proposed and extensively studied, e.g., Deep Belief Network (DBN), Boltzmann Machines (BM), Restricted Boltzmann Machines (RBM), Deep Boltzmann Machine (DBM), Deep Neural 6 Networks (DNN), etc. Among various techniques, the deep convolutional neural networks, which is a discriminative deep architecture and belongs to the DNN category, has found state-of-the-art performance on various tasks and competitions in computer vision and image recognition. Specifically, the CNN model consists of several convolutional layers and pooling layers, which are stacked up with one on top of another. The convolutional layer shares many weights, and the pooling layer sub-samples the output of the convolutional layer and reduces the data rate from the layer below. The weight sharing in the convolutional layer, together with appropriately chosen pooling schemes, endows the CNN with some invariance properties (e.g., translation invariance). My work is similar to the work of Ji Wan et al. but differs from them in the sense that the dataset I am using is different from the ones they have used in their study. Also my approach of image matching will be completely novel which has not been used in any study similar to mine.
Distance metric learning (DML) is an important concept of image retrieval which has been studied very extensively in machine learning [, ]. In this section I will discuss some already existing work for DML which can be organized by different leaning settings and principles. Most of the current DML studies work with 2 types of data or side information when dealing with training data formats: pairwise constraints where the constraints for must-link and cannot-link are given and triplet constraints which consists of similar and dissimilar pair. There are studies which use the class labels directly for DML by following a typical machine learning scheme like large margin nearest neighbor (LMNN) algorithm. I have gone with the use of class labels directly for DML. There are typically 2 groups into which distance metric learning can be categorized with respect to different learning techniques: the local supervised approach, where metric learning is done on the local sense when the given local constraints from neighboring 7 information are satisfied, and the global supervised approach where all the constraints are satisfied simultaneously for metric learning on a global setting. Most of the current DML studies use the batch learning method as a learning methodology where before the training task the whole collection of training data must be given and a model is trained from scratch. The key concept on which distance metric learning is based is that for an optimal metric the distance between similar images should be minimized and distance between dissimilar images is maximized.
The image features are treated as words in order to apply the Bag of Words model to image classification . A bag of visual words  in computer vision is defined as a vector of occurrence counts of a vocabulary of local image features . In my project I used a dictionary of 40 words. To compute the key-points I first used SURF and then compared the results with SIFT so as to be sure which was working better for our project. SURF is a robust local feature detector. It uses an integer approximation to the determinant of Hessian blob detector, which can be computed extremely quickly with an integral image (3 integer operations). I used the HESSIAN THRESHOLD as 600. SIFT is an algorithm to detect and describe local features in images. The local image gradients are measured at the selected scale in the region around each key-point. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination. The method proposed by me used JSEG segmentation to segment the query image into regions. I then extracted color features to describe each region and SURF features from the entire image. The texture features were extracted using Gabor Filters. After all features are extracted I computed bag of words using the SURF of each region and combine with color and texture to generate feature vector. A random forest classifier was used to assign a class to each region and then we compute a similarity score against every image on the dataset based on the current region labels and rank them by this score. After ranking the images the top n (n is the number of resultant images required by user) images were retrieved based
The images in our dataset contain annotations of different regions in the form of XML files. The Extensible Markup Language(XML) annotations provide the annotated image description of each image in the dataset as shown in fig.3.1. With the help of XML annotations we generate a mask which gives us the region masks of that image. The combination of region masks and the XML annotations is used to generate descriptions of the image based on 3 main features. The color features, texture features and description of images using key-points and Bag of words. These annotated image description are stored in an index in the form of a dictionary so as to easily access them.
We use a combination of dominant color, average color, dominant channel and fuzzy color histogram to describe the color feature. The dominant color uses representative colors to characterize the color information in the required region of an image thus making it a compact and efficient descriptor. Local features of an image can be well represented by a dominant color descriptor which helps in fast and efficient retrieval of images from large datasets. The average color descriptor returns the average of all colors present in the image and compares to it. The dominant channel descriptor takes into consideration the dominant tone per channel and returns the percentage of the dominant channels. Fuzzy 3D color histograms are required to compute dominant color. Fuzzy version are more balanced for colors that fall between color bins. We have used only 8 color bins in this project.
We used Gabor Filter as a texture feature descriptor. Gabor Filter is a linear filter used for edge detection. It is an image filter that can be used to describe texture of the image. The Gabor Filters are of any arbitrary size and orientation and are good to detect edge orientations in images. The only drawback of Gabor Filters is that it is scale-sensitive. We also added the average and standard deviation of brightness for each region to complement the information provided by the Gabor Filter.
The query image is a user input image which he wants to use as a sample to retrieve images from the dataset. The query image can be from any source and need not be from our dataset. The system takes the input query image and uses JSEG segmentation which is explained in the next section to segment the image. The segmented image is used to generate the region masks and from these region masks feature extraction takes place which gives a feature vector as explained in fig.3.2. The feature vectors are passed into a region classifier (in our system it is a random forest classifier) which gives classified regions as shown in fig.3.4.
We needed segmentation of images for the retrieval part of the project. We used JSEG segmentation for this project as it is considered to be one of the best segmentation algorithm around for segmenting color images. The reason for this is that it takes into consideration not only color but also the texture while segmenting the image. Images are segmented in an unsupervised manner based on color-texture regions by JSEG which is includes performing color quantization and spatial segmentation independently. In the color quantization step, 12 the regions in the image are differentiated by quantizing the colors in the image to several representative classes. A class map of the image is then formed by replacing the image pixels by their corresponding color class labels . As shown in fig.3.1 the segmentation part is more of a black box method. Fig.3.3 shows segmentation of a sample query image. For this project we used the already implemented version of the algorithm and made a script to process the images with the application. T
Once the image is segmented using JSEG we get the region masks based on the segmentation of region. These region masks are used for feature extraction so as to give a feature vector on which we apply a region classifier i.e. Random Forest in this case so as to give us classified regions like in fig.3.4 . 13 We have color coded the 9 classes in the region classification for our convenience. In the image shown in fig.3.4 dark blue region is classified as water, light blue is sky, brown is ground, dark gray is unknown and yellow is mountain. We store the region classification of all the images in our dataset match them with the region classification of the query image as mentioned in the next section.
We take a query image and segment it using JSEG algorithm. It returns us segmented region mask as .gif files. As openCV cannot upload the .gif files we use the Pygame library to load it as a 2D array so that it can be processed. Once we get the 2D array we separate out the region masks for each region in the image. We calculate the class percentage of each region in the query image and match it with class percentage of all the other images in the database so as to generate a similarity measure score. It works using ”Histogram Intersection” which means taking two histograms and choosing the minimum value on each bin. Then, you add 14 those values and the result is the similarity score. The images with the higher similarity score are returned based on how many images you want.