In #7, I have talked about choosing the adjusted R square as the error metric. In this article, I am going to summarize the features of K nearest neighbour model, the simplest model to use in predicting house price.
The reason I want to use KNN is just that it is simple. I hope the model can generate usable results for the Minimum Viable Product (MVP).
The purpose of this post is to summarize the features of KNN with reference to different books and resources. By referencing to multiple sources, we are more likely to get a holistic view of the KNN model. I am not an expert in Machine Learning field, but I hope by sharing my knowledge, I can learn from you guys.
Here is a list of resources I have referenced to:
- Machine Learning: A Probabilistic Perspective by Kevin P. Murphy
- Pattern Recognition and Machine Learning by Christopher M. Bishop
- The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- CS4780/CS5780: Machine Learning for Intelligent Systems from Cornel by Kilian Weinberger
In this post, I am going to focus on classification first. In the next post, I will talk about how KNN model can be used in regression.
A Gentle Introduction
KNN model classifies a data point by taking a majority vote among the k nearest neighbours.
The above image shows a 5 nearest neighbours case (k=5). Since the majority of points in the left circle is red, the unknown test sample will be classified as red. Similarly, the same happens in the right circle. Noted that the points are not necessarily two dimensions. They can be three dimensions, plotted with 3 axes or even higher which is beyond human visualisation capability.
Express in mathematical terms,
I know it is scary, but let me explain. Our training set
contains features and labels. Features are denoted as and labels are denoted as . So, on the left-hand side, it is the probability of the label being a particular class given the features , and which is the number of the nearest neighbour.
On the right hand side, it is the total number of points of class in the k nearest neighbour over the total number of k nearest neighbour . The indicator function means if , it is 1 and 0 otherwise. Summing over all the K nearest points , just means counting the number of points of class .
The general idea is simple enough (the math part may be a bit challenging if you are not familiar with mathematical symbols). But what are the characteristics and points to note while using KNN model?
KNN is a non-parametric model (or sometimes called memory-based learning/ instance-based learning). It means the number of parameters grows with the amount of data. To know whether a test point is of a particular class, I need to remember all training points. The more data you have, the more calculations on KNN, the slower the performance.
We keep saying nearest neighbour. What do you mean by nearest? How do you measure what is near? The distance matrix is key to the effectiveness of KNN. The commonly used distance matrix is Euclidean Distance (Remember the Pythagoras Theorem?). But there are many more distance matrix.
In high dimensional spaces, the concept of nearest neighbour may not be valid anymore. Imagine all the people are living on the ground floor (2D), everyone is near. If suddenly, the building becomes 30/F (3D), there are more spaces between each other.
The above figure illustrated that the higher the dimension, the fewer the data points falling into the red region. Therefore, in high dimension, KNN is not effective since the concept of nearest may not be correct.
The concept of nearest neighbour also suffers from scaling problem. If features are on a different scale, for example, one in meter, one in centimetre, the concept of nearest neighbour is not comparable between the two. The 1s in 1cm and 1m are treated as the same quantity in KNN since KNN does not know the units. But in fact, 1cm is much shorter than 1m. To solve this, usually, the features are first standardized and there are different ways to achieve this.
A More Probabilistic View
This is a more advance section.
Suppose that observations are being drawn from some unknown probability density (the unknown we are going to find out) in some D-dimensional space. Let us consider some small region containing (the red box in the above figure). The probability mass in this region is given by:
(It other words, this is the probability of falling within )
Suppose we have collected a data set containing observations drawn from Since each point has a probability falling within , the total number of points that lie inside , denoted as will be distributed according to the binomial distribution (either within or not):
Here, we assumed some knowledge regarding binomial distribution, including the derivation of the formula and its mean and variance. If you need a refreshment, please refer to this video series.
Knowing that the mean of the distribution is and the variance is . We see that the mean fraction of points falling inside the region is
and the variance is
For large N, the real fraction of points falling inside the region is just the expected value. Therefore:
If, however, we also assume that the region is sufficiently small that the probability density is roughly constant over the region (If we zoom in to a curve, it looks like a straight line), then we have:
is the volume of . Combining the two equations and solving for , we have
Noted that the two assumptions, namely sufficiently large and sufficiently small , are contradicting.
Also noted that there are two variables, , in the equation. Fixing K, we will have the KNN model; Fixing V, we have the kernel approach.
We now allow the radius of the sphere to grow () until it contains precisely data points. Using the Bayes Theorem, we have
is the class label. Therefore, the sum of all points in each class is equal to the total number of points
We already have derived. And
The last formula is just the conditioned version of .
Combining the information above, we have
We get back our first formula at the very beginning!
Lower Bound and Upper Bound of KNN Error
Assume that only two classes and using 1-KNN
Consider that the number of points, goes to infinity, then the distance between our test point and the nearest neighbour will tend to be zero, and therefore, is just like .
What is the probability that is not ?
There are only two cases where they are different and the probability is
Since and are drawing from the same distribution, . And and are less than 1. We have:
In other words, the error of 1-KNN is less than or equal to two times the Bayes Optimal (For more detail on Bayes Optimal Error).
What is the upper bound then? That is the most common label in the training set. In KNN, that is when .
Next time, we will look at how KNN can be used in Regression problems.