In machine learning, is it always true that you will achieve higher classification accuracy if you use more features ? In other words, does more features always mean higher accuracy ?
This is the question that I'm going to analyze and answer in this post. I hope you find it useful. Please let me know if you have any questions in the comments, I will be happy to answer. For that we will use the Skin Segmentation Dataset and the Naive Bayesian Classifier. I implement the Bayesian classifier, and no libraries or Toolboxes are used.
The Skin Segmentation Dataset (Rajen Bhatt, Abhinav Dhall, UCI Machine Learning Repository) is collected by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and Asian), and genders obtained from FERET database and PAL database. Total learning sample size is 245057; out of which 50859 is the skin samples and 194198 is non-skin samples. The dataset is of the dimension 245057 * 4 where first three columns are B,G,R (x1,x2, and x3 features) values and fourth column is of the class labels (decision variable y).
The first 90% in each class samples are chosen to form the training data, while the remaining 10% of points in each class form the test data.
Data Set Characteristics:
|
Univariate
|
Number of Instances:
|
245057
|
Area:
|
Computer
|
Attribute Characteristics:
|
Real
|
Number of Attributes:
|
4
|
Date Donated
|
2012-07-17
|
Associated Tasks:
|
Classification
|
Missing Values?
|
N/A
|
Number of Web Hits:
|
66931
|
Figure 1 shows plots of the two classes likelihood, the fitted normal distribution (not used by the classifier and plotted for visualization only), and posterior probabilities built from the training data. The plot is for the best single designated feature which I found to the third one (the pixel red color feature). The prior for the first class W1 (skin class) is chosen to be 0.4 and the second class W2 (non-skin class) is chosen to be 0.6. These two values are chosen to reflect the fact that skin colors are really a smaller subset of all the possible colors, and this is the reason why these values achieves better Minimum Correct Classification Rate (MCCR).
Figure 1. Plots of the two classes’ likelihood and posterior probabilities for the best single feature (the red color feature)
Now, I pick two features together until identifying the best couple of features. Figure 2 shows plots of the likelihood and posterior for the best couple of features, which are the blue color and red color features. You have the code and you can try and you will come to the same result as I.
Figure 2. Plots of the two classes’ likelihood and posterior probabilities for the best couple of features (blue and red colors features)
The normal Bayes rule is applied for each test sample x, and the classification to class i is done such that:, where w1 means skin class and w2 means non-skin class. The priors are chosen such that P(w1)=0.4 and P(w1)=0.6 to reflect the fact that skin colors are really a smaller subset of all the possible colors, and this is the reason why these values achieves better Minimum Correct Classification Rate (MCCR). For fair comparison for the three cases of features space dimensionality, the histogram bins width is kept fixed in all the cases; a color value feature takes a range from 0 to 255, and the bin width is chosen to be equal to 10 (set empirically based on the database size).
For the best single feature, the best couple of features, and the three features combined, the resulted MCCR are equal to 0.8439, 0.9581, and 0.9128 respectively as shown in figure 3.
Figure 3. MCCR for different number of features
As stated earlier, the red feature is proven to be the best single feature. Also, my experiment indicated that if any two features out of the three colors features are combined, a better MCCR is achieved than the red color feature alone! The gain is maximized in the case of combining the red and blue colors features (best couple of features).
If the green feature is combined with the best couple of features to form combined three features, the MCCR is still better than the best single feature case. But it slightly decreases compared to the best couple of features case. In general, it is not always better to use more features for higher accuracy, but to use the right features. Any feature has some good aspects where a good representation of some aspects of the data is achieved, and some other bad aspects, where a noisy representation of some aspects of the data is achieved. If the good aspects of a feature are already correlated with the existing features in a system (more than the bad aspects), adding that feature will introduce more overall bad feature representation and reduces the MCCR. This is what happened when we added the green color feature to the best couple of features.
Note: On the other hand, the higher the number of features (especially when the number of training samples is small) the more probable to over-fit on the training data, and hence you achieve lower classification accuracy in test data. This is one reason why PCA is good. We will discuss it in detail in a coming post.
Note: On the other hand, the higher the number of features (especially when the number of training samples is small) the more probable to over-fit on the training data, and hence you achieve lower classification accuracy in test data. This is one reason why PCA is good. We will discuss it in detail in a coming post.
No comments:
Post a Comment