I'm working on a comparison of popular gradient descent algorithms in Python. Here is a link to the notebook I've got going.
The Adagrad algorithm converges at a much slower rate than the plain vanilla batch, stochastic and mini-batch algorithms. I was expecting it to be an improvement from the basic methods. Is the difference attributable to one or more of the factors below or something else, or is this the expected result?
- The test data set is small and Adagrad performs relatively better on larger data sets
- Something having to do with the characteristics of the sample data
- Something having to do with the parameters
- An error in the code
Here is the code for Adagrad - it is also the last one in the notebook.
def gd_adagrad(data, alpha, num_iter, b=1):
m, N = data.shape
Xy = np.ones((m,N+1))
Xy[:,1:] = data
theta = np.ones(N)
grad_hist = 0
for i in range(num_iter):
np.random.shuffle(Xy)
batches = np.split(Xy, np.arange(b, m, b))
for B_x, B_y in ((B[:,:-1],B[:,-1]) for B in batches):
loss_B = B_x.dot(theta) - B_y
gradient = B_x.T.dot(loss_B) / B_x.shape[0]
grad_hist += np.square(gradient)
theta = theta - alpha * gradient / (10**-6 + np.sqrt(grad_hist))
return theta
theta = gd_adagrad(data_norm, alpha*10, 150, 50)