I've been trying to build an image classifier with CNN. There are 2300 images in my dataset and two categories: men and women. Here's the model I used:
early_stopping = EarlyStopping(min_delta = 0.001, patience = 30, restore_best_weights = True)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(256, (3, 3), input_shape=X.shape[1:],  activation = 'relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2D(256, (3, 3), input_shape=X.shape[1:], activation = 'relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Flatten())  # this converts our 3D feature maps to 1D feature vectors
model.add(tf.keras.layers.Dense(64))
model.add(tf.keras.layers.Dense(1, activation='softmax'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
h= model.fit(xtrain, ytrain, validation_data=(xval, yval), batch_size=32, epochs=30, callbacks = [early_stopping], verbose = 0)
Accuracy of this model is 0.501897 and loss 7.595693(the model is stuck on these numbers in every epoch) but if I replace Softmax activation with Sigmoid, accuracy is about 0.98 and loss 0.06. Why does such strange thing happen with Softmax? All info I could find was that these two activations are similar and softmax is even better but I couldn't find anything about such abnormality. I'll be glad if someone could explain what the problem is.
 
                        
Summary of your results:
TLDR
Update:
Now that I also see you are using only 1 output neuron with Softmax, you will not be able to capture the second class in binary classification. With Softmax you need to define K neurons in the output layer - where K is the number of classes you want to predict. Whereas with Sigmoid: 1 output neuron is sufficient for binary classification.
so in short, this should change in your code when using softmax for 2 classes:
Additionally:
When doing binary classification, a sigmoid function is more suitable as it is simply computationally more effective compared to the more generalized softmax function (which is normally being used for multi-class prediction when you have K>2 classes).
Further Reading:
Some attributes of selected activation functions
If the short answer above is not enough for you, I can share with you some things I've learned from my research about activation functions with NNs in short:
To begin with, let's be clear with the terms activation and activation function
Now to the effects of activation functions:
Now let's only compare sigmoid, relu/maxout and softmax:
sigmoid:
relu:
maxout:
softmax:
Some good references for further reading: