The repository includes my collected test dataset. But for training data,
download Caltech101 Dataset and put the classes folders inside "101_ObjectCategories" folder.
download Caltech101 Dataset and put the classes folders inside "101_ObjectCategories" folder.
In computer vision and image analysis, the bag-of-words model (BoW model, also known as bag-of-features) can be applied to achieve image classification, by treating image features as words. In textual document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. The selected keywords forming the bag of words represents the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features. Image category classification (categorization) is the process of assigning a category label to an image under test. Categories may contain images representing just about anything, for example, dogs, cats, trains, boats. In this project, we use a bag of features approach for image category classification.
Figure 1. Bag-of-features models
2- Database Preparations:
Downloading a suitable image data set is necessary for this project. Caltech101 data set  is one of the most widely cited used data sets collected by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato in September 2003. It can be downloaded from the link:
https://www.vision.caltech.edu/Image_Datasets/Caltech101/. The data set consists of pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. Samples from the data set are shown in figure 2. For my code to work properly, put all the classes folders inside "101_ObjectCategories" folder.
Figure 2. Samples from the Caltech101 data set
For this project, we load all the data set images, then three categories (classes) only are chosen to be used. The selected classes are: 'headphone', 'soccer_ball' , and 'watch'. The number of sample images in each class is 42, 64, and 239 samples respectively.
There is a fundamental quantization step in the bag of features technique of clustering as will be demonstrated in coming sections. For this clustering to be efficient, we make sure that the number of images in the training set is balanced. Each class should provide the same amount of training data. Because during the deployment of the presented classification system, test data are expected to be balanced; no category/class is expected to happen more frequently than the others. So, balancing the number of features across all image categories is carried out to improve clustering. Because the first class (‘headphone’) has the least number of representing samples (42 samples), only the first 42 samples of the other two classes are considered during the training and validation process, the other samples are ignored. So, each class provides 42 sample images.
The representing samples of each class are split into training and validation image sets. There is a configuration parameter for setting how much percentage of the data is considered as training data. But typically, 30% is found to be a reasonable value. 30% of images from each set are considered for the training data and the remainder, 70%, are considered for the validation data. The validation data are used to validate and set the different parameters of the system.
3- System overview
Figure 3 shows the different modules of the system and the interaction/data flow between these modules. The first stage is the feature extraction. In this project, we extract the SURF (speeded up robust features) features  are extracted for all the training images. SURF is partly inspired by the scale-invariant feature transform (SIFT) descriptor. Then the strongest features from each class are chosen to form the bag of features containing the vocabulary of visual words. The strongest features are chosen based on a predefined percentage parameter inputted to the system. These strongest features are then reduced through quantization of feature space using K-means clustering. For each image in the training data, SURF features are extracted and then quantized to the obtained K-means (the visual words). Then a histogram of visual word occurrences that represent that image is encoded. The histograms of the training data are used to train a classifier (it’s Support Vector Machine (SVM) in this project). That classifier is used during system deployment to classify the histograms obtained for test images.
Figure 3. System overview
4- Configurations Parameters:
- trainingDataSizePercent : Defines how much percentage of the data is kept for training. The remaining data is considered for validation.
- numberOfClusters : The number of clusters representing the number of features in the bag of features. This is the k parameter for the K-means clustering algorithm.
- ratioOfStrongFeatures : A Fraction of strongest features, specified as the comma-separated pair consisting of StrongestFeatures and a value in the range [0,1]. The value represents the fraction of strongest features to use from each input class.
- SVM_Kernel : Kernel function that the SVM classifier uses to map the training data into kernel space. The default kernel function is the dot product. The kernel function can be one of the following character vectors or a function handle:
'linear' — Linear kernel, meaning dot product.
'quadratic' — Quadratic kernel.
'polynomial' — Polynomial kernel (default order 3). Specify another order with the polyorder name-value pair.
'rbf' — Gaussian Radial Basis Function kernel with a default scaling factor, gamma, of 1. Specify another value for gamma with the SVM_RBF_Gamma parameter.
'mlp' — Multilayer Perceptron kernel with default scale [1 –1].
- SVM_C : The C parameter tells the SVM classifier how much you want to avoid misclassifying each training example. This is why higher values make the classifier prone to outliers.
If C is a scalar, it is automatically rescaled by N/(2*N1) for the data points of group one and by N/(2*N2) for the data points of group two, where N1 is the number of elements in group one, N2 is the number of elements in group two, and N = N1 + N2. This rescaling is done to take into account unbalanced groups, that is cases where N1 and N2 have very different values.
If C is an array, then each array element is taken as a box constraint for the data point with the same index.
- SVM_RBF_Gamma : The scaling factor in case that RBF kernel is used.
5- System Validation and Parameter Choices:
The system parameters are tweaked via the validation process. The effect of changing each system parameter on the validation accuracy is monitored. The number of clusters of 500 with fraction of strongest features of 0.8 is proven to give best results. Using a linear SVM kernel results into low system validation accuracy. While the Polynomial and RBF kernels give high performance. The training and validation data is not big enough to proof which is better between the Polynomial and RBF kernels.
The choice of RBF scaling factor or Polynomial order in case of using RBF or Polynomial kernels respectively are interesting. With higher values, the classifier becomes more complex compared to the amount of training data. It easily overfits the training data, leading to a perfect classification accuracy on training data, but poor classification accuracy on unseen data during training (validation and test data). This observation is demonstrated in figure 4. While with low values, the classifier capability to model non-linearly separable data decreases and becomes more like the weak linear kernel. This is why a moderate value is found to work best. The typical used values are shown in the appended MATLAB source code.
Figure 4. How overfitting affects prediction
The C parameter tells the SVM classifier how much we want to avoid misclassifying each training example. This is why higher values make the classifier prone to outliers. Figure 5 shows two examples on low and high C values. To the left having a low c gives a pretty large minimum margin (purple). However, this requires that we neglect the blue circle outlier that we have failed to classify correct. On the right you have a high c. Now the classifier will not neglect the outlier and thus end up with a much smaller margin.
Figure 5. Example of better lower C
Figure 6, however, shows that it’s not always a lower C is better. A very small value of C will cause the classifier to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. The classifier often gets misclassified points, even if the training data is simply linearly separable. A moderate value of C is obtained via validation.
Figure 6. Example of better higher C
6- System Constrains:
It has been pointed out that the Caltech101 data set categories that have more pictures are somewhat easier for classification (e.g. Airplanes (800+), Motorcycles(800+), Faces(400+)), while other categories have under 40 images and are more difficult.
Test images should have little or no clutter. The objects tend to be centered in each image. Most objects are presented in a stereotypical pose. The objects also tend to fill the image. The images should have a correct exposure. Figure 7 shows how a correct exposure should look. Test images resolution should be within 300 x 200 pixels to match the training data resolution.
Figure 7. Correct exposure is important
Figrue 8 shows an example of the resulted histogram for a training image. Table 1 shows the confusion matrix resulted when inputting the training data to be tested with the system. Table 2 shows the same table for the validation data. Both of the two sets result into average classification accuracy of 90.23%.
Figure 8. Example of visual word occurrences of a training data sample
Table 1. Training Data Confusion Matrix
Table 2. Validation Data Confusion Matrix
I captured 20 images; 5 for each selected classes ('headphone', 'soccer_ball' , and 'watch') and 5 images that do not contain any of the training three categories/classes (‘random_class’). The images are shown in figure 9.
Figure 9. My Collected Test Data
Table 3 shows the confusion matrix of the collected test data. Average accuracy of 87% is achieved on the test unseen data.
Table 3. Test Data Confusion Matrix
Figure 10 shows the classification results on the random data consisting of images that do not contain any of the training three categories/classes. Each image is classifier to the nearest class that looks most similar to it out of the three trained classes.
Figure 10. Classification results of images that do not contain any of the training three categories/classes
- L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model Based Vision. 2004.
- Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool, "Speeded Up Robust Features", ETH Zurich, Katholieke Universiteit Leuven.