The GPflow docs provide an example for multi-class classification with the robust-max function. I am trying to train a multi-class classifier with the softmax likelihood instead, which is also implemented in GPflow but I can't find any documentation or examples on how to use it correctly.
Please find an example of what I have tried below. During training, the loss decreases smoothly.
The robust-max example mentioned above uses categorical labels, i.e., values 0, 1, 2, but simply replacing the robust-max by the softmax likelihood raises an IndexError in the quadrature method. Therefore, I assume this model with the softmax likelihood requires one-hot encoded labels. At test time, however, I noticed that for this three-class toy example the model never predicts class 3. Upon closer inspection, the softmax likelihood has the following method
def _log_prob(self, F, Y):
        return -tf.nn.sparse_softmax_cross_entropy_with_logits(logits=F, labels=Y[:, 0])
which looks like it expects an array of categorical labels of shape [num_samples, 1].
What is the right way to use the softmax likelihood for GP multi-class classification?
import numpy as np
import tensorflow as tf
import gpflow
from gpflow.likelihoods.multiclass import Softmax
from tqdm.auto import tqdm
np.random.seed(0)
tf.random.set_seed(123)
# Number of functions and number of data points
num_classes = 3
N = 100
# Create training data
# Jitter
jitter_eye = np.eye(N) * 1e-6
# Input
X = np.random.rand(N, 1)
# SquaredExponential kernel matrix
kernel_se = gpflow.kernels.SquaredExponential(lengthscales=0.1)
K = kernel_se(X) + jitter_eye
# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(num_classes)).T
# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
# One-hot encoding
Y_hot = np.zeros((N, num_classes), dtype=np.int)
Y_hot[np.arange(N), Y] = 1
data = (X, Y_hot)
# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
likelihood = Softmax(num_classes)
m = gpflow.models.VGP(
    data=data,
    kernel=kernel,
    likelihood=likelihood,
    num_latent_gps=num_classes,
)
def run_adam(model, iterations):
    """
    Utility function running the Adam optimizer
    """
    # Create an Adam Optimizer action
    losses = []
    training_loss = model.training_loss
    optimizer = tf.optimizers.Adam()
    @tf.function
    def optimization_step():
        optimizer.minimize(training_loss, model.trainable_variables)
    for step in tqdm(range(iterations), total=iterations):
        optimization_step()
        if step % 10 == 0:
            elbo = -training_loss().numpy()
            losses.append(elbo)
    return losses
run_adam(model=m, iterations=10000)
y_pred = m.predict_y(X)[0]
print("Training accuracy: {:3.2f}".format(np.mean(Y == np.argmax(y_pred, axis=1))))
 
                        
Softmax uses Monte Carlo to estimate the expected log likelihood, so you do always need Adam. The nice thing about RobustMax is that for the full dataset, you can use bfgs. My experience is that RobustMax isn't that great from a statistical perspective though...
I can never remember what form the labels should be passed in, and it's very possible that there is a difference between
RobustMaxandSoftMax. You found the correct method, and I would refer to the TensorFlow documentation oftf.nn.sparse_softmax_cross_entropy_with_logits()to determine what the right format is.