Regression Model with 3 Hidden DenseVariational Layers in Tensorflow-Probability returns nan as loss during training

807 views Asked by At

I am getting acquainted with Tensorflow-Probability and here I am running into a problem. During training, the model returns nan as the loss (possibly meaning a huge loss that causes overflowing). Since the functional form of the synthetic data is not overly complicated and the ratio of data points to parameters is not frightening at first glance at least I wonder what is the problem and how it could be corrected.

The code is the following --accompanied by some possibly helpful images:

# Create and plot 5000 data points

x_train = np.linspace(-1, 2, 5000)[:, np.newaxis]
y_train = np.power(x_train, 3) + 0.1*(2+x_train)*np.random.randn(5000)[:, np.newaxis]

plt.scatter(x_train, y_train, alpha=0.1)
plt.show()

enter image description here

# Define the prior weight distribution -- all N(0, 1) -- and not trainable

def prior(kernel_size, bias_size, dtype = None):
    
    n = kernel_size + bias_size
    
    prior_model = Sequential([
        
        tfpl.DistributionLambda(
        
            lambda t: tfd.MultivariateNormalDiag(loc = tf.zeros(n)  ,  scale_diag = tf.ones(n)
                                                
                                                ))
        
    ])
    
    return(prior_model)

# Define variational posterior weight distribution -- multivariate Gaussian

def posterior(kernel_size, bias_size, dtype = None):
    
    n = kernel_size + bias_size
    
    posterior_model = Sequential([
        
        tfpl.VariableLayer(tfpl.MultivariateNormalTriL.params_size(n)  , dtype = dtype),   # The parameters of the model are declared Variables that are trainable
        
        tfpl.MultivariateNormalTriL(n)  # The posterior function will return to the Variational layer that will call it a MultivariateNormalTril object that will have as many dimensions
                                        # as the parameters of the Variational Dense Layer.  That means that each parameter will be generated by a distinct Normal Gaussian shifted and scaled
                                        # by a mu and sigma learned from the data, independently of all the other weights.  The output of this Variablelayer will become the input to the
                                        # MultivariateNormalTriL object.
                                        # The shape of the VariableLayer object will be defined by the number of paramaters needed to create the MultivariateNormalTriL object given
                                        # that it will live in a Space of n dimensions (event_size = n).  This number is returned by the tfpl.MultivariateNormalTriL.params_size(n)
        
        
    ])
    
    return(posterior_model)

x_in = Input(shape = (1,))

x = tfpl.DenseVariational(units= 2**4,
                          make_prior_fn=prior,
                          make_posterior_fn=posterior,
                          kl_weight=1/x_train.shape[0],
                          activation='relu')(x_in)

x = tfpl.DenseVariational(units= 2**4,
                          make_prior_fn=prior,
                          make_posterior_fn=posterior,
                          kl_weight=1/x_train.shape[0],
                          activation='relu')(x)

x =    tfpl.DenseVariational(units=tfpl.IndependentNormal.params_size(1),
                          make_prior_fn=prior,
                          make_posterior_fn=posterior,
                          kl_weight=1/x_train.shape[0])(x)

y_out =  tfpl.IndependentNormal(1)(x)

model = Model(inputs = x_in, outputs = y_out)

def nll(y_true, y_pred):
    return -y_pred.log_prob(y_true)

model.compile(loss=nll, optimizer= 'Adam')
model.summary()

enter image description here

Train the model

history = model.fit(x_train1, y_train1, epochs=500)

enter image description here

2

There are 2 answers

0
Michael Glazunov On

The problem seems to be in the loss function: negative log-likelihood of the independent normal distribution without any specified location and scale leads to the untamed variance which leads to the blowing up the final loss value. Since you're experimenting with the variational layers, you must be interested in the estimation of the epistemic uncertainty, to that end, I'd recommend to apply the constant variance.

I tried to make a couple of slight changes to your code within the following lines:

  1. first of all, the final output y_out comes directly from the final variational layer without any IndpendnetNormal distribution layer:

    y_out =    tfpl.DenseVariational(units=1,
                      make_prior_fn=prior,
                      make_posterior_fn=posterior,
                      kl_weight=1/x_train.shape[0])(x)
    
  2. second, the loss function now contains the necessary calculations with the normal distribution you need but with the static variance in order to avoid the blowing up of the loss during training:

     def nll(y_true, y_pred):
                       dist = tfp.distributions.Normal(loc=y_pred, scale=1.0)
                       return tf.reduce_sum(-dist.log_prob(y_true))
    
  3. then the model is compiled and trained in the same way as before:

     model.compile(loss=nll, optimizer= 'Adam')
     history = model.fit(x_train, y_train, epochs=3000)
    
  4. and finally let's sample 100 different predictions from the trained model and plot these values to visualize the epistemic uncertainty of the model:

     predicted = [model(x_train) for _ in range(100)]
     for i, res in enumerate(predicted):
                       plt.plot(x_train, res , alpha=0.1)
     plt.scatter(x_train, y_train, alpha=0.1)
     plt.show()
    

After 3000 epochs the result looks like this (with the reduced number of training points to 3000 instead of 5000 to speed-up the training):

enter image description here

0
Peter Pirog On

The model has 38,589 trainable parameters but you have only 5,000 points as data; so, effective training is impossible with so many parameters.