As I understand it, in a deep neural network, we use an activation function (g) after applying the weights (w) and bias(b) (z := w * X + b | a := g(z)). So there is a composition function of (g o z) and the activation function makes so our model can learn function other than linear functions. I see that Sigmoid and Tanh activation function makes our model non-linear, but I have some trouble seeing that a ReLu (which takes the max out of 0 and z) can make a model non-linear...
Let's say if every Z is always positive, then it would be as if there was no activation function...
So why does ReLu make a neural network model non-linear?
Deciding if a function is linear or not is of course not a matter of opinion or debate; there is a very simple definition of a linear function, which is roughly:
for every
x&yin the function domain anda&bconstants.The requirement "for every" means that, if we are able to find even a single example where the above condition does not hold, then the function is nonlinear.
Assuming for simplicity that
a = b = 1, let's tryx=-5, y=1withfbeing the ReLU function:so, for these
x&y(in fact for everyx&ywithx*y < 0) the conditionf(x + y) = f(x) + f(y)does not hold, hence the function is nonlinear...The fact that we may be able to find subdomains (e.g. both
xandybeing either negative or positive here) where the linearity condition holds is what defines some functions (such as ReLU) as piecewise-linear, which are still nonlinear nevertheless.Now, to be fair to your question, if in a particular application the inputs happened to be always either all positive or all negative, then yes, in this case the ReLU would in practice end up behaving like a linear function. But for neural networks this is not the case, hence we can rely on it indeed to provide our necessary non-linearity...