knn manual calculation vs. R class package

45 views Asked by At

In order to better understand kNN method, i want to manually replicate what is R doing with knn function of class package.

First, data can be gound here https://github.com/NPejovicE/kNN. It is a csv file containing traffic signs with 3 categories: pedestrian, speed and stop. To explain the dataset, signs are assumed to be divided in 16 pieces, and color of the center of each piece is measured with r/g/b color codes. So, each sign has 48 columns (16 x 3). Data is split in train and test dataset.

signs %>% filter(sample == "train") -> train_data
signs %>% filter(sample == "test") -> test_data

I want to predict 12th row from my test data.

test_data %>% slice(12) %>% select(4:ncol(test_data)) -> my_test

I'll do kNN classification via `class package in R:

knn(train_data[4:ncol(train_data)], unlist(my_test), cl = train_data$sign_type)

pedestrian
Levels: pedestrian speed stop

And it say it's pedestrian.

Now, I'll try to manually calculate Euclidian distance on scaled values. I'll only used needed columns from train and test data, and extract row 12 from test data.

train_data[4:ncol(train_data)] -> train_data_clean
test_data[4:ncol(test_data)] -> test_data_clean

as.data.frame(scale(train_data_clean)) -> scaled_train
as.data.frame(scale(test_data_clean)) -> scaled_test

scaled_test %>% slice(12) -> test_row

Now step by step distance calulation:

scaled_train - unlist(test_row) -> diff
# Square diff:
diff^2 -> diff2
# Sum across rows:
rowSums(diff2) -> diff3
# Take a square root:
sqrt(diff3) -> diff4
# Find row with min value.
which.min(diff4) -> n

It says row 72 from my train_data has minimum distance.

When I look back at my original train data, it is:

 train_data %>% slice(72) %>% select(sign_type)

It is stop and not pedestrian as with class package.

How can I replicate result from class package, and am I doing something wrong here.

Edit:

Even when I standardise test_data with mean and sd from train data, the result still differs. I substract for each column in test_data_clen corresponding mean from train_data_clen, and divide by sd of corresponding column.

#Calculate column-wise mean and sd:

unlist(sapply(train_data_clean, mean)) -> means
unlist(sapply(train_data_clean, sd)) -> sds

#Subtract mean from each observation in test_data:

as.data.frame(sapply(1:ncol(test_data_clean), function(i) test_data_clean[, i] - means[i])) -> data2

#Divide by sd:
as.data.frame(sapply(1:ncol(data2), function(i) data2[,i]/sds[i])) -> test_data_standardised


# Repeat process for calculatin Eucledian distance:

test_data_standardised %>% slice(12) -> test_row
scaled_train - unlist(test_row) -> diff
# Square diff:
diff^2 -> diff2
# Sum across rows:
rowSums(diff2) -> diff3
# Take a square root:
sqrt(diff3) -> diff4
# Find row with min value.
which.min(diff4) -> n

It is type speed (row 52).

0

There are 0 answers