So I have found a formula describing the SGD-Descent
θ = θ-η*∇L(θ;x,y)
Where θ is a parameter, η is the learning rate and ∇L() is the gradient descent of the loss-function. But what I don't get is how the parameter θ (which should be weight and bias) can be updated mathematically? Is there a mathematical interpretation of the parameter θ?
Thanks for any answers.
That formula applies to both gradient descent and stochastic gradient descent (SGD). The difference between the two is that in SGD the loss is computed over a random subset of the training data (i.e. a mini-batch/batch) as opposed to computing the loss over all the training data as in traditional gradient descent. So in SGD
xandycorrespond to a subset of the training data and labels, whereas in gradient descent they correspond to all the training data and labels.θrepresents the parameters of the model. Mathematically this is usually modeled as a vector containing all the parameters of the model (all the weights, biases, etc...) arranged into a single vector. When you compute the gradient of the loss (a scalar) w.r.t.θyou get a vector containing the partial derivative of loss w.r.t. each element ofθ. So∇L(θ;x,y)is just a vector, the same size asθ. If we were to assume that the loss were a linear function ofθ, then this gradient points in the direction in parameter space that would result in the maximal increase in loss with a magnitude that corresponds to the expected increase in loss if we took a step of size 1 in that direction. Since loss isn't actually a linear function and we actually want to decrease loss we instead take a smaller step in the opposite direction, hence the η and minus.It's also worth pointing out that mathematically the form you've given is a bit problematic. We wouldn't usually write it like this since assignment and equal aren't the same thing. The equation you provided would seem to imply that the
θon the left-hand and right-hand side of the equation were the same. They are not. Theθon the left side of the equal sign represents the value of the parameters after taking a step and theθs on the right side correspond to the parameters before taking a step. We could be more clear by writing it with subscriptswhere
θ_{t}is the parameter vector at steptandθ_{t+1}is the parameter vector one step later.