I am currently in the process of trying to develop a feed-forward neural network n-gram language model using TensorFlow 2.0. Just to be clear, I do not want this to be implemented via a recurrent neural network, I simply want to use a few Dense layers and a Softmax layer to accomplish this. This is the reference that I have used; the architecture of the model has also been outlined, https://www.researchgate.net/publication/301875194_Authorship_Attribution_Using_a_Neural_Network_Language_Model
However, when I tried to do this, I kept getting an error. Given below is my model,
tf.keras.optimizers.Adam(learning_rate=0.01)
model = tf.keras.Sequential([
tf.keras.layers.Embedding(total_words, 300, weights = [embeddings_matrix], input_length=inputs.shape[1], trainable = False),
tf.keras.layers.Dense(100, activation = 'relu'),
tf.keras.layers.Dense(total_words, activation = 'softmax')
])
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
When this code is run, the error I get is as follows,
ValueError: Shapes (None, 7493) and (None, 116, 7493) are incompatible
Can someone please tell me how to resolve this? I am slightly confused.
In the paper you linked, the group aims to do word-to-word translation while considering the context of the source word. Therefore the input to the network is a stack of words - the context. Your minibatch of word-stacks should have the dimension
batch x input_length
and contain (integer) indices, as the Embedding layer is basically a lookup table (e.g. returns the fifth row of its weights on input '5'). This is a bit different to the paper, where the input seems to be one-hot encoded vectors.Since the embedding layer returns a matrix row for each integer in the input, here it will output a tensor of size
(batch, input_length, 300)
with 300 being your embedding size.Your second layer(the relu-activated Dense) will now transform this to a tensor of size
(batch, input_length, 100)
, leaving theinput_length
-dimension intact. Dense layers in TF-Keras transform over the last axis of the input, so in your first Dense a bunch of sub-tensors of size1 x 1 x 300
would be transformed to a size of1 x 1 x 100
and then concatenated along dimensions 0 and 1. The same thing would happen in your second Dense.Since you do not want to predict all words in your context, you have to 'get rid' of the
input_length
dimension. In the paper, the embeddings are stacked to produce a tensor of sizebatch x (input_length*embedding_size)
which is then fed to the Dense-layers. They describe this in the last paragraph on page 1.A
Flatten()
-layer between Embedding and Dense should do the trick in your implementation, as it will squash all dimensions (except the batch dimension). Then, the first Dense will get abatch x (input_length*300)
Tensor, the second abatch x 100
Tensor, and the model will output abatch x total_words
Tensor.In your implementation, I would then guess this should contain a one-hot encoding of a word for each batch entry. This is what they use in the paper, and where categorical cross-entropy makes sense.
BTW, setting weights in the Embedding-layer is deprecated - you should use
embeddings_initializer=tf.keras.initializers.Constant(embeddings_matrix)
.EDIT: Further clarification on sizes, this did not fit in a comment