I'm currently interested in using Cross Entropy Error when performing the BackPropagation algorithm for classification, where I use the Softmax Activation Function in my output layer.
From what I gather, you can drop the derivative to look like this with Cross Entropy and Softmax:
Error = targetOutput[i] - layerOutput[i]
This differs from the Mean Squared Error of:
Error = Derivative(layerOutput[i]) * (targetOutput[i] - layerOutput[i])
So, can you only drop the derivative term when your output layer is using the Softmax Activation Function for classification with Cross Entropy? For instance, if I were to do Regression using the Cross Entropy Error (with say TANH activation function) I would still need to keep the derivative term, correct?
I haven't been able to find an explicit answer on this and I haven't attempted to work out the math on this either (as I am rusty).
You do not use the derivative term in the output layer since you get the 'real' error (the difference between your output and your target), in the hidden layers you have to calculate the approximate error using backpropagation.
What we are doing is an approximation taking the derivate of the error of the next layer against the weights of the current layer instead of the error of the current layer (that its unknown).
Best regards,