As a basic proof of concept, in a network that classifies K classes with input x, bias b, output y,S samples, weights v and t teacher signal in which t(k) equals 1 if the matching sample is under k class.
Let x_(is) represent the i_(th) input feature in the s_(th) sample. v_(ks) represents the vector that holds the weights of connection to k_(th) output from all inputs within the s_(th) sample. t_(s) represents the teacher signal for s_(th) sample.
If we extend the above variables to consider multiple samples, the changes below has to be applied while declaring the variable z_(k), the activation function f(.) and using the corss entropy as a cost function: Derivation
Typically in learning rule, delta ( t_(k) - y_(k) ) is always included, why Delta doesnt show up in this equation? have i missed something or the delta rule showing up isnt a must?
I managed to find the solution, it's clear when we consider the Kronecker delta in which Where (δck = 1 if a class matches the classifier and δck otherwise). which means the derivation takes this shape:
Derivation
which leads to the delta rule.