In #7, I have talked about choosing the adjusted R square as the error metric. In this article, I am going to summarize the features of K nearest neighbour model, the simplest model to use in predicting house price.

The reason I want to use KNN is just that it is simple. I hope the model can generate usable results for the Minimum Viable Product (MVP).

The purpose of this post is to summarize the features of KNN with reference to different books and resources. By referencing to multiple sources, we are more likely to get a holistic view of the KNN model. I am not an expert in Machine Learning field, but I hope by sharing my knowledge, I can learn from you guys.

Here is a list of resources I have referenced to:

In this post, I am going to focus on classification first. In the next post, I will talk about how KNN model can be used in regression.

A Gentle Introduction

KNN model classifies a data point by taking a majority vote among the k nearest neighbours.

An example of kNN classification task with k = 5 | Download Scientific  Diagram

The above image shows a 5 nearest neighbours case (k=5). Since the majority of points in the left circle is red, the unknown test sample will be classified as red. Similarly, the same happens in the right circle. Noted that the points are not necessarily two dimensions. They can be three dimensions, plotted with 3 axes or even higher which is beyond human visualisation capability.

Express in mathematical terms,

\[P(y=c|x,D,K) = \frac{1}{K}\sum_{i \in N_K(x,D)}\mathbb I(y_i = c)\]

I know it is scary, but let me explain. Our training set D contains features and labels. Features are denoted as x and labels are denoted as y. So, on the left-hand side, it is the probability of the label y being a particular class c given the features x, D and K which is the number of the nearest neighbour.

On the right hand side, it is the total number of points of class c in the k nearest neighbour over the total number of k nearest neighbour K. The indicator function \mathbb{I} means if y_i = c, it is 1 and 0 otherwise. Summing over all the K nearest points \sum_{i \in N_K(x,D)}, just means counting the number of points of class c.

The general idea is simple enough (the math part may be a bit challenging if you are not familiar with mathematical symbols). But what are the characteristics and points to note while using KNN model?

KNN is a non-parametric model (or sometimes called memory-based learning/ instance-based learning). It means the number of parameters grows with the amount of data. To know whether a test point is of a particular class, I need to remember all training points. The more data you have, the more calculations on KNN, the slower the performance.

We keep saying nearest neighbour. What do you mean by nearest? How do you measure what is near? The distance matrix is key to the effectiveness of KNN. The commonly used distance matrix is Euclidean Distance (Remember the Pythagoras Theorem?). But there are many more distance matrix.

In high dimensional spaces, the concept of nearest neighbour may not be valid anymore. Imagine all the people are living on the ground floor (2D), everyone is near. If suddenly, the building becomes 30/F (3D), there are more spaces between each other.

Curse of dimensionality part 1: Value at Risk

The above figure illustrated that the higher the dimension, the fewer the data points falling into the red region. Therefore, in high dimension, KNN is not effective since the concept of nearest may not be correct.

The concept of nearest neighbour also suffers from scaling problem. If features are on a different scale, for example, one in meter, one in centimetre, the concept of nearest neighbour is not comparable between the two. The 1s in 1cm and 1m are treated as the same quantity in KNN since KNN does not know the units. But in fact, 1cm is much shorter than 1m. To solve this, usually, the features are first standardized and there are different ways to achieve this.

A More Probabilistic View

This is a more advance section.

Suppose that observations are being drawn from some unknown probability density p(x) (the unknown we are going to find out) in some D-dimensional space. Let us consider some small region R containing x (the red box in the above figure). The probability mass in this region is given by:

\[P = \int_{R}p(x)dx\]

(It other words, this is the probability of falling within R)

Suppose we have collected a data set containing N observations drawn from p(x). Since each point has a probability P falling within R, the total number of points that lie inside R, denoted as K will be distributed according to the binomial distribution (either within or not):

\[Bin(K|N,P) = \frac{N!}{K!(N-K)!}P^K(1-p)^(N-K)\]

Here, we assumed some knowledge regarding binomial distribution, including the derivation of the formula and its mean and variance. If you need a refreshment, please refer to this video series.

Knowing that the mean of the distribution is NP and the variance is NP(1-P). We see that the mean fraction of points falling inside the region is

\[\mathbb{E}[K/N] = P\]

and the variance is

\[var[K/N] = \frac{P(1-P)}{N}\]

For large N, the real fraction of points falling inside the region is just the expected value. Therefore:

\[K \simeq NP\]

If, however, we also assume that the region R is sufficiently small that the probability density P(x) is roughly constant over the region (If we zoom in to a curve, it looks like a straight line), then we have:

\[P \simeq p(x)V\]

V is the volume of R. Combining the two equations and solving for P(x), we have

\[p(x) = \frac{K}{NV}\]

Noted that the two assumptions, namely sufficiently large N and sufficiently small R, are contradicting.

Also noted that there are two variables, K, V, in the equation. Fixing K, we will have the KNN model; Fixing V, we have the kernel approach.

We now allow the radius of the sphere to grow (V) until it contains precisely K data points. Using the Bayes Theorem, we have

\[p(y=c|x) = \frac{p(x|y=c)p(y=c)}{p(x)} \]

c is the class label. Therefore, the sum of all points in each class is equal to the total number of points

\[\sum_{c} N_{c} = N\]

We already have p(x) derived. And

\[p(y=c) = \frac{N_c}{N}\]
\[p(x|y=c) = \frac{k_c}{N_cV}\]

The last formula is just the conditioned version of p(x).

Combining the information above, we have

\[p(y=c|x) = \frac{k_c}{K} = \frac{1}{K}\sum_{i \in N_K(x,D)}\mathbb I(y_i = c)\]

We get back our first formula at the very beginning!

Lower Bound and Upper Bound of KNN Error

Assume that only two classes and using 1-KNN

Consider that the number of points, N goes to infinity, then the distance between our test point x_t and the nearest neighbour x_{nn} will tend to be zero, and therefore, x_{nn} is just like x_t.

\[n \rightarrow \infty, dist(x_{nn},x_t) \rightarrow 0, x_{nn} \rightarrow x_t\]

What is the probability that x_{nn} is not x_t?

There are only two cases where they are different and the probability is

\[\epsilon = (1-p(c|x_t))p(c|x_{nn}) + p(c|x_t)(1-p(c|x_{nn}))\]

Since x_{nn} and x_t are drawing from the same distribution, p(c|x_t) = p(c|x_{nn}). And p(c|x_{nn}) and p(c|x_t) are less than 1. We have:

\[\epsilon \leq 2(1-p(c|x_t)) \]

In other words, the error of 1-KNN is less than or equal to two times the Bayes Optimal (For more detail on Bayes Optimal Error).

What is the upper bound then? That is the most common label in the training set. In KNN, that is when K=N.

Next time, we will look at how KNN can be used in Regression problems.

Leave a Reply

%d bloggers like this: