I'm implementing language model training on penn treebank.
I'm adding loss for each timestep and then calculating perplexity.
This gives me non-sensically high perplexity of hundreds of billions even after training for a while.
Loss itself decreases but only down to about 20 at best. (I need one-digit number for loss to get sensible perplexity).
This makes me wonder whether my perplexity calculation is misguided.
Should it be based on the loss for each timestep and then averaging instead of adding them all up?
My batch_size is 20, num_steps is 35.
def perplexity(loss):
perplexity = np.exp(loss)
return perplexity
...
loss = 0
x = nn.Variable((batch_size, num_steps))
t = nn.Variable((batch_size, num_steps))
e_list = [PF.embed(x_elm, num_words, state_size, name="embed") for x_elm in F.split(x, axis=1)]
t_list = F.split(t, axis=1)
for i, (e_t, t_t) in enumerate(zip(e_list, t_list)):
h1 = l1(F.dropout(e_t,0.5))
h2 = l2(F.dropout(h1,0.5))
y = PF.affine(F.dropout(h2,0.5), num_words, name="pred")
t_t = F.reshape(t_t,[batch_size,1])
loss += F.mean(F.softmax_cross_entropy(y, t_t))
for epoch in range(max_epoch):
....
for i in range(iter_per_epoch):
x.d, t.d = get_words(train_data, i, batch_size)
perp = perplexity(loss.d)
....
It appears that you're calculating the exponential of the sum of the cross entropy loss. Perplexity, through, is defined as two to the power of the entropy.
Perplexity(M)=2^entropy(M)
Perplexity(M) = 2^(-1/n)(log2(P(w1, w2,...,wn)))
where log2 = log base 2
So yes, it should be based on the loss for each timestep, rather than either taking the sum or the mean. Taking the sum like you are now will dramatically inflate your cross entropy loss, so then raising 2 to the power of that value will be very large.
More details can be found here