I'm taking Andrew Ng's ML class on Coursera and am a bit confused on gradient descent. The screenshot of the formula I'm confused by is here:
In his second formula, why does he multiply by the value of the ith training example? I thought when you updated you were just subtracting the step size * the cost function (which shouldn't include the ith training example.
What am I missing? It doesn't make much sense to me, especially since the ith training example is a series of values, not just one...
Thanks, bclayman
Mathematically, we are trying here to minimise error function
To minimise error, we do
and derive the above formula.
The rest of the formulation can be reasoned as
Gradient descent uses the slope of the function itself to find the maxima. Think it as coming downhill in a valley by taking direction such that downward slope is minimum. So, we get the direction but what should be the step size(how long should we continue to move in the same direction?)?
For that also we use the slope. Since at minima slope is zero.(Just think of bottom of a valley since all its nearby points are higher than this. So, there must be this one point where height was reducing, slope was negative and height started increasing, slope changed sign, became negative to positive and in between the minima was point of zero slope.) To reach 0 slope, magnitude of slope decreases towards the minima. So, if magnitude of slope is high, we can take large steps, and if it's low we are closing in on minima and should take small steps.