In this article, I describe how to use regression to tackle a classification problem. Regression and classification are fundamental topics in machine learning. To remind you, in regression: the output variable takes continuous values, while in classification: the output variable takes class labels. To demonstrate and proof the concept, I wrote a configurable MATLAB code that you can download from the link below (no MATLAB toolboxes are used):
In the link above, I provide source code for Least Squares Regression along with two data sets to run the code on. Each set consists of sample data points repressing two classes. One of the sets represents a linearly-separable classification problem, and the other set is for a non-linearly separable problem. You can easily configure the code to train a model on any of the two sets, or any custom data set you have created by setting the variable: dataFileName.
To use the Least Squares Regression to solve a classification problem, a simple trick is used. The data points of the first and second classes are extended by adding a new extra dimension. This produces an augmented cloud of points in n+1 dimensional space, where n is the size of the original data space. In that extra dimension, the data points belonging to the first and second classes take values of -1 and +1 respectively.
Then, samples of the augmented data (with the extra dimension) are fitted using Least Square Regression. In my code, the function to be fitted is chosen to be a polynomial function. The regression objective is to estimate the parameters of that polynomial such that it best fits the training data in a least-squres sense. You can easily change the order of the polynomial by setting the variable: polynomial_order. If it's set to 1, in case of the 2D data points I used as example with my code, the fitting polynomial will represent a plane in 3D. If it's set to more than 1, it will allow curvatures and hence more complex data fitting.
To achieve classification, the classification decision boundary is simply the intersection between the fitted polynomial surface and the surface where the extra dimension is constant at a value midway between -1 and +1. The 1 and -1 in the previous sentence are equal to the values we have previously set in the extra dimension for each class. If we set different values, it should be different. Figure 1 shows the decision boundary of classifying linear data from different point of views, and figure 2 shows the same for the wave-alike data, where misclassified samples are circled in red. In such 2D data points case, the decision boundary is the intersection of the fitted polynomial and the horizontal plane passing by z=0 (z is the extra dimension here).
Figure 1. Least square models to classify linear data from different points of view
Figure 2. Least square models to classify the wave-alike data from different points of view, misclassified samples are circled in red.
For classification accuracy, I use the Minimum Correct Classification Rate (MCCR). MCCR is defined as the minimum of CCR1 and CCR2. CCRn is the ratio of the correctly classified test points in class n divided by the total number of test points in class n. The MCCR for the linear data set is zero using a polynomial of order 3. For the wave-alike data, the MCCR = 0.94. The reason behind not achieving a perfect MCCR=1 for the wave-alike data is that classification with Least Squares Regression is prone to outliers, and it tries to fit a function such that all the training points give a small squared errors. Figure 3 is taken from Chapter 4 of "Pattern Recognition and Machine Learning" by Bishop. Both images in the figure shows the classification decision boundary obtained from a Least Squares Regression as detailed above in purple color. The decision boundary is good, until some outliers data points are added to the blue class as in the image to the right. The resulted classier penalizes these outliers even that they are 'too correct' data points. What is in green is the decision boundaries obtained by Logistic Regression. The advantage of it is that it's prone to outliers, and does not penalize the 'too correct' data points.