Activation functions: Softmax vs Sigmoid

3.6k views Asked by At

I've been trying to build an image classifier with CNN. There are 2300 images in my dataset and two categories: men and women. Here's the model I used:

early_stopping = EarlyStopping(min_delta = 0.001, patience = 30, restore_best_weights = True)
model = tf.keras.Sequential()

model.add(tf.keras.layers.Conv2D(256, (3, 3), input_shape=X.shape[1:],  activation = 'relu'))

model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2D(256, (3, 3), input_shape=X.shape[1:], activation = 'relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Flatten())  # this converts our 3D feature maps to 1D feature vectors

model.add(tf.keras.layers.Dense(64))

model.add(tf.keras.layers.Dense(1, activation='softmax'))


model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

h= model.fit(xtrain, ytrain, validation_data=(xval, yval), batch_size=32, epochs=30, callbacks = [early_stopping], verbose = 0)

Accuracy of this model is 0.501897 and loss 7.595693(the model is stuck on these numbers in every epoch) but if I replace Softmax activation with Sigmoid, accuracy is about 0.98 and loss 0.06. Why does such strange thing happen with Softmax? All info I could find was that these two activations are similar and softmax is even better but I couldn't find anything about such abnormality. I'll be glad if someone could explain what the problem is.

2

There are 2 answers

2
d-xa On BEST ANSWER

Summary of your results:

  • a) CNN with Softmax activation function -> accuracy ~ 0.50, loss ~ 7.60
  • b) CNN with Sigmoid activation function -> accuracy ~ 0.98, loss ~ 0.06

TLDR

Update:

Now that I also see you are using only 1 output neuron with Softmax, you will not be able to capture the second class in binary classification. With Softmax you need to define K neurons in the output layer - where K is the number of classes you want to predict. Whereas with Sigmoid: 1 output neuron is sufficient for binary classification.

so in short, this should change in your code when using softmax for 2 classes:

#use 2 neurons with softmax
model.add(tf.keras.layers.Dense(2, activation='softmax'))

Additionally:

When doing binary classification, a sigmoid function is more suitable as it is simply computationally more effective compared to the more generalized softmax function (which is normally being used for multi-class prediction when you have K>2 classes).


Further Reading:

Some attributes of selected activation functions

If the short answer above is not enough for you, I can share with you some things I've learned from my research about activation functions with NNs in short:

To begin with, let's be clear with the terms activation and activation function

activation (alpha): is the state of a neuron. The state of neurons in hidden or output layers will be quantified by the weighted sum of input signals from a previous layer

activation function f(alpha): Is a function that transforms an activation to a neuron signal. Usually a non-linear and differentiable function as for instance the sigmoid function. Many applications & research has been applied with the sigmoid function (see Bengio & Courville, 2016, p.67 ff.). Mostly the same activation function is being used throughout the neural network, but it is possible to use multiple (e.g. different ones in different layers).

Now to the effects of activation functions:

The choice of activation function can have an immense impact on learning of neural networks (as you have seen in your example). Historically it was common to use the sigmoid function, as it was a good function to depict a saturated neuron. Today, especially in CNNs other activation functions, also only partially linear activation functions (like relu) is being preferred over sigmoid function. There are many different functions, just to name some: sigmoid, tanh, relu, prelu, elu ,maxout, max, argmax, softmax etc.

Now let's only compare sigmoid, relu/maxout and softmax:

# pseudo code / formula
sigmoid = f(alpha) = 1 / (1 + exp(-alpha))
relu = f(alpha) = max(0,alpha)
maxout = f(alpha) = max(alpha1, alpha2)
softmax = f(alpha_j) = alpha_j / sum_K(alpha_k)

sigmoid:

  • in binary classification preferably used for output layer
  • values can range between [0,1], suitable for a probabilistic interpretation (+)
  • saturated neurons can eliminate gradient (-)
  • not zero centered (-)
  • exp() is computationally expensive (-)

relu:

  • no saturated neurons in positive regions (+)
  • computationally less expensive (+)
  • not zero centered (-)
  • saturated neurons in negative regions (-)

maxout:

  • positive attributes of relu (+)
  • doubles the number of parameters per neuron, normally requires an increased learning effort (-)

softmax:

  • can bee seen as a generalization of sigmoid function
  • mainly being used as output activation function in multi-class prediction problems
  • values range between [0,1], suitable for a probabilistic interpretation (+)
  • computationally more expensive because of exp() terms (-)

Some good references for further reading:

1
Addy On

The reason why you see those different results is the size of your output layer - it is 1 neuron.

Softmax by definition requires more than 1 output neuron to make sense. 1 Softmax neuron will always output 1 (lookup the formula and think about it). That is why you see ~50% accuracy, since your network always predicts class 1.

Sigmoid doesn't have this problem and can output anything, that's why it trains.

If you want to test softmax, you have to make an output neuron for each class and then "one-hot encode" your ytrain and yval (look up one-hot encoding for more explanations). In your case this means: label 0 -> [1, 0], label 1 -> [0, 1]. You can see, the index of the one encodes the class. I'm not sure but in that case I believe you'd use the categorical cross entropy. I was not able to tell conclusively from the docs, but it seems to me that binary cross entropy expects 1 output neuron that's either 0 or 1 (where Sigmoid is the correct activation to use) whereas the categorical cross entropy expects one output neuron for each class, where Softmax makes sense. You could use Sigmoid even for the multioutput case, but it's not common.

So in short, it seems to me that binary xentropy expects the class encoded by the value of the 1 neuron, whereas categorical xentropy expects the class encoded by which output neuron is the most active. (in simplifying terms)