I've been trying to understand the backpropagation algorithm to a level where I can program a neural network and its learning procedures from scratch just like programming anything else. What I struggle with, mainly, are the shapes of the matrices in the calculations. The moment they start transposing the matrices here and there my mind just loses it. Also with some of the calculations themselves.
I've been learning matrices in mathematics by myself only for this for a while now, so expect some understanding of it but also a bit of confusion.
This is what I found on the following link:
https://pyimagesearch.com/2021/05/06/backpropagation-from-scratch-with-python/
def fit_partial(self, x, y):
A = [np.atleast_2d(x)]
for layer in np.arange(0, len(self.W)):
net = A[-1].dot(self.W[layer])
out = self.sigmoid(net)
A.append(out)
error = A[-1] - y
D = [error * self.sigmoid_derivative(A[-1])]
for layer in np.arange(len(A) - 2, 0, -1):
delta = np.matmul(D[-1], self.W[layer].T) * self.sigmoid_derivative(A[layer])
D.append(delta)
D = D[::-1]
for layer in np.arange(0, len(self.W)):
self.W[layer] -= self.alpha * np.matmul(A[layer].T, D[layer])
(Note that I modified the code a little in order for it to make more sense to me, although the result is exactly the same, as you can confirm if you take a look at the link or if you understand backpropagation)
The formula to update the weights is the following:
W_ij' = W_ij - a * (∂C / ∂W_ij)
The problem is, I can't find the intuition behind the relationship between that formula and the code. I'll explain myself now.
Going line by line:
(Correct me if I'm wrong when I assume I understand a line when I don't.)
A = [np.atleast_2d(x)]
In this line, you make sure that the list that will store the activations for each layer, namely A, is at least 2-dimensional, in order to be compatible with further calculations (namely, matrix multiplications.)
for layer in np.arange(0, len(self.W)):
Here we loop over the weights in order to start feedforwarding.
net = A[-1].dot(self.W[layer])
Here we multiply the last activations by the current weights. Feedforwarding begins.
out = self.sigmoid(net)
Here we activate the current layer.
Important: Something that I want to make sure of is, is the result of A[-1].dot(self.W[layer]) always a matrix which axis 0 always equals to 1? In other words, does the resulting matrix of that calculation always have one row? (I'm supposing this because, as it is supposed to represent a layer [before being activated], they should be a vector, rather than a more than one-dimensional matrix.)
A.append(out)
Here we simply construct the list that stores the activations.
error = A[-1] - y
Here the heavy confusion starts. I do understand the backpropagation algorithm, but I don't know how to implement it, so that's basically not understanding it.
I understand that it all starts from getting the difference between the expected output and the actual output. This makes sense, as this way you have control over how "wrong" the network performs. For instance, if you want to get closer to being "right", you should get further from that difference between being "wrong", namely, substracting from the actual prediction the desired one, and backpropagation is all about having control over the error. This is the intuition behind the mere fact of using gradient descent in this algorithm.
And, if you want the rate of change of a function, you use a derivative. He uses squared error as the error measure, so, my question is, why are you calculating error = A[-1] - y when the derivative of (p - y) ** 2 is 2 * (p - y) and not simply p - y.
In other words, why is that line not error = 2 * (A[-1] - y) instead of error = A[-1] - y?
Is it just a matter of efficiency, as the algorithm would perform the same (or better) if we don't include that multiplication by 2?
D = [error * self.sigmoid_derivative(A[-1])]
In this line, self.sigmoid_derivative(A[-1]) is broadcasted onto error.
The number of columns of self.sigmoid_derivative(A[-1]) has to be the same as in error, right?
We then simply start building the delta chain for the chain rule.
for layer in np.arange(len(A) - 2, 0, -1):
Here, we start looping over the layers, backwards, to build the chain of derivatives. He stated the following with respect to not taking into account the last two layers:
# the last two layers are a special case where the input
# connections need a bias term but the output does not
What does this mean? The fact that when we are in the last (output) layer, as it is not a hidden layer, but the previous one is, we still need to specify the dimensions of the matrix in a way that is compatible with the previous layer having a bias weight but the current one doesn't?
The first most important part of the question
delta = np.matmul(D[-1], self.W[layer].T) * self.sigmoid_derivative(A[layer])
This single line is the part of the algorithm that makes me confused the most.
According to the chain rule, this line should have the following appearance:
∂C/∂a1 * ∂a1/∂a0 * ∂a0/w0 ...
but it doesn't. It is multiplying directly by the weight matrix, not with another derivative.
To be more specific, with dang transposal of the weight matrix. I simply don't know what is happening here. To my understanding, it should actually look like: delta = D[-1] * self.sigmoid_derivative(A[layer])
That's how it would make sense to me. I don't get what is the weight matrix doing there as a term. And also and specially why is it being transposed.
D.append(delta)
Here we start building the delta chain for the chain rule.
for layer in np.arange(0, len(self.W)):
Here we look over the weights of the network.
The second most important part of the question
self.W[layer] -= self.alpha * np.matmul(A[layer].T, D[layer])
I get what is happening here, it's following the formula:
W_ij' = W_ij - a * (∂C / ∂W_ij)
Which I love by the way, it's so beautiful... Until I see the weights being transposed again. I don't get that part.