My goal is to ask if it is possible to use pre-trained GloVe vectors in a Tensorflow word-rnn LSTM generative model, and if so any guidance on how to achieve this?
I am referencing this from here
and I understand(I think) that I am supposed to put the vectors in the embeddings in line 35-37
of the model.py. From the code, I see that he is not using any pre-trained vectors but the words from the input text.
I have seen other answers like this but as I am new to Tensorflow and Python I do not fully understand how to apply this into the code.
GloVe generates two files, namely:
- vocabulary file, with the count of all word occurrences
- vector file. e.g the word [
also -0.5432 -0.3210 0.1234...n_dimensions..
]
Also, do I have to generate the GloVe vectors and train the LSTM model on the same corpus or can they be separate? eg. GloVe(100k words
), text_to_train(50k words
)
Thank you for the assistance!
Embeddings are word encoding, you load a pre-trained Glove encoding "dictionary" with 400 000 entries, where each token or entry is encoded in a 1D-vector of dim 50 for Glove 50, 100 for Glove 100 etc.
Your input dataset of dim N, M will go through encoding, each entry in the input dataset is encoded in the Glove encoding and stored in a row of the embedding matrix, of dim N, 50 or N, 100 etc.
You build a Keras embedding layer from this embedding matrix, which output is fed into the LSTM.
https://keras.io/examples/nlp/pretrained_word_embeddings/