Implementation of backpropagation algorithm

142 views Asked by At

I'm building a neural network with the architecture:

input layer --> fully connected layer --> ReLU --> fully connected layer --> softmax

I'm using the equations outlined here DeepLearningBook to implement backprop. I think my mistake is in eq. 1. When differentiating do I consider each example independently yielding an N x C (no. of examples x no. of classes) matrix or together to yield an N x 1 matrix?

# derivative of softmax
da2 = -a2    # a2 comprises activation values of output layer
da2[np.arange(N),y] += 1
da2 *= (a2[np.arange(N),y])[:,None]

# derivative of ReLU
da1 = a1    # a1 comprises activation values of hidden layer
da1[a1>0] = 1

# eq. 1
mask = np.zeros(a2.shape)
mask = [np.arange(N),y] = 1
delta_2 = ((1/a2) * mask) * da2 / N 
# delta_L = - (1 / a2[np.arange(N),y])[:,None] * da2 / N

# eq.2
delta_1 = np.dot(delta_2,W2.T) * da1

# eq. 3
grad_b1 = np.sum(delta_1,axis=0)
grad_b2 = np.sum(delta_2,axis=0)

# eq. 4
grad_w1 = np.dot(X.T,delta_1)
grad_w2 = np.dot(a1.T,delta_2)

Oddly, the commented line in eq. 1 returns the correct value for biases but I can't seem to justify using that equation since it returns an N x 1 matrix which is multiplied with the corresponding rows of da2.

Edit: I'm working on the assignment problems of the CS231n course which can be found here: CS231n

1

There are 1 answers

0
Nimit Pattanasri On

I also couldn't find any explanation about this elsewhere. So I write a post :) Please read it here.