I had an unexpected output while implementing SGD algorithm for my ML homework.
This is part of my training data which normally has 320 rows:
my dataset: https://github.com/Jangrae/csv/blob/master/carseats.csv
I first did some data preprocessing:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
train_data = pd.read_csv('carseats_train.csv')
train_data.replace({'Yes': 1, 'No': 0}, inplace=True)
onehot_tr = pd.get_dummies(train_data['ShelveLoc'], dtype=int, prefix_sep='_', prefix='ShelveLoc')
train_data = train_data.drop('ShelveLoc', axis=1)
train_data = train_data.join(onehot_tr)
train_data_Y = train_data.iloc[:, 0]
train_data_X = train_data.drop('Sales', axis=1)
Then implemented the algorithm like this:
learning_rate = 0.01
epoch_num = 50
initial_w = 0.1
intercept = 0.1
w_matrix = np.ones((12, 1)) * initial_w
for e in range(epoch_num):
for i in range(len(train_data_X)):
x_i = train_data_X.iloc[i].to_numpy()
y_i = train_data_Y.iloc[i]
y_estimated = np.dot(x_i, w_matrix) + intercept
grad_w = x_i.reshape(-1, 1) * (y_i - y_estimated)
grad_intercept = (y_i - y_estimated)
w_matrix = w_matrix - 2 * learning_rate * grad_w
intercept = intercept - 2 * learning_rate * grad_intercept
print("Final weights:\n", w_matrix)
print("Final intercept:", intercept)
But the output was
Final weights:
[[nan]
[nan]
[nan]
[nan]
[nan]
[nan]
[nan]
[nan]
[nan]
[nan]
[nan]
[nan]]
Final intercept: [nan]
I run it with various learning rates and I also tried convergence threshold but still got the same result.. I couldn't find out why my code gives me nans..
Can anybody see the issue?
You get an overflow of numbers in your code. The gradients basically get too large with your setting. Consider taking more epochs and a much lower learning rate (aka. "step-size") to make your algorithm converge. I was able to get results with a learning rate of
0.000001
, but you will have to see for your training set what the "correct" number could be and also monitor the convergence (depending on the number of epochs). You could also consider an adaptive learning rate schedule.On another note: I am not exactly sure that your equations are correct. Since you use
(y_i - y_estimated)
and not the other way around, it might be that you need to update your weights and intercept with+
(a "double minus", if you will). Maybe you can check that again. (For comparison: here or here)PS: Your algorithm is not yet "stochastic". ;D