Feed Forward Neural Network Language Model

367 views Asked by At

I am currently in the process of trying to develop a feed-forward neural network n-gram language model using TensorFlow 2.0. Just to be clear, I do not want this to be implemented via a recurrent neural network, I simply want to use a few Dense layers and a Softmax layer to accomplish this. This is the reference that I have used; the architecture of the model has also been outlined, https://www.researchgate.net/publication/301875194_Authorship_Attribution_Using_a_Neural_Network_Language_Model

However, when I tried to do this, I kept getting an error. Given below is my model,

tf.keras.optimizers.Adam(learning_rate=0.01)
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(total_words, 300, weights = [embeddings_matrix], input_length=inputs.shape[1], trainable = False),
                             tf.keras.layers.Dense(100, activation = 'relu'),
                             tf.keras.layers.Dense(total_words, activation = 'softmax')
])

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

When this code is run, the error I get is as follows,

ValueError: Shapes (None, 7493) and (None, 116, 7493) are incompatible

Can someone please tell me how to resolve this? I am slightly confused.

1

There are 1 answers

2
Tim On

In the paper you linked, the group aims to do word-to-word translation while considering the context of the source word. Therefore the input to the network is a stack of words - the context. Your minibatch of word-stacks should have the dimension batch x input_length and contain (integer) indices, as the Embedding layer is basically a lookup table (e.g. returns the fifth row of its weights on input '5'). This is a bit different to the paper, where the input seems to be one-hot encoded vectors.

Since the embedding layer returns a matrix row for each integer in the input, here it will output a tensor of size (batch, input_length, 300) with 300 being your embedding size.

Your second layer(the relu-activated Dense) will now transform this to a tensor of size (batch, input_length, 100), leaving the input_length-dimension intact. Dense layers in TF-Keras transform over the last axis of the input, so in your first Dense a bunch of sub-tensors of size 1 x 1 x 300 would be transformed to a size of 1 x 1 x 100 and then concatenated along dimensions 0 and 1. The same thing would happen in your second Dense.

Since you do not want to predict all words in your context, you have to 'get rid' of the input_length dimension. In the paper, the embeddings are stacked to produce a tensor of size batch x (input_length*embedding_size) which is then fed to the Dense-layers. They describe this in the last paragraph on page 1.

A Flatten()-layer between Embedding and Dense should do the trick in your implementation, as it will squash all dimensions (except the batch dimension). Then, the first Dense will get a batch x (input_length*300) Tensor, the second a batch x 100 Tensor, and the model will output a batch x total_words Tensor.

In your implementation, I would then guess this should contain a one-hot encoding of a word for each batch entry. This is what they use in the paper, and where categorical cross-entropy makes sense.

BTW, setting weights in the Embedding-layer is deprecated - you should use embeddings_initializer=tf.keras.initializers.Constant(embeddings_matrix).

EDIT: Further clarification on sizes, this did not fit in a comment