Some background:
I have data structured as the a TFIDF vector of shape (15637, 31635)
and this is the input this vector into the LSTM
layer. The longest word in my vocabulary is 305
words and each TFIDF
vector has length 31635
because the total vocabulary in the corpus has these many words.
Each of the 15637 sentences
is a TFIDF
vector of form (31635, )
.
I am using the TFIDF instead of a pre-trained embedding
layer.
No_of_sentences = 15637
BATCH_SIZE = 64
steps_per_epoch = 15637/64 = 244 (with remainder dropped)
vocab_inp_size = 31635. #These were tokens created by Keras tokenizer. and are the distinct words in the input corpus
vocab_tar_size = 4. #This is One-Hot encoding of target value
.
The code below first creates tensor slices
, then batches the tensor slices
and finally enumerates
each batch
to give a tuple
of form: (batch, (input_tensor, target_tensor))
.
dataset = tf.data.Dataset.from_tensor_slices((input_tfidfVector, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) # this is where batching happens
for (batch) in enumerate(dataset.take(steps_per_epoch)):`
print (batch) #this will print the tuple: curent batch (batch 0) but also the input and the target tensor
(0, (<tf.Tensor: shape=(64, 31635), dtype=float64, numpy=
array([[0. , 1.74502835, 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 1.74502835, 0. , ..., 0. , 0. ,
0. ],
[0. , 1.74502835, 3.35343652, ..., 0. , 0. ,
0. ]])>, <tf.Tensor: shape=(64, 1), dtype=int32, numpy=
array([[3],
[1],
[2],
[1],
[3],
[1],
[1],
[1],
[1],
[2],
[2],
[2],
[3],
[2],
[2],
[2],
[2],
[2],
[1],
[2],
[1],
[2],
[3],
[2],
[3],
[1],
[1],
[1],
[3],
[1],
[1],
[2],
[2],
[2],
[2],
[2],
[2],
[3],
[3],
[1],
[1],
[3],
[1],
[1],
[1],
[2],
[1],
[1],
[3],
[2],
[1],
[3],
[1],
[3],
[3],
[1],
[2],
[1],
[1],
[1],
[2],
[1],
[1],
[1]], dtype=int32)>))
Question:
I am not using a pre-trained embedding layer - but a TFIDF vector for each sentence. I am not removing stop words from the input - so TFIDF would downweight any words that are too frequent across the corpus.
Let's say I just use tokens created by the keras tokenizer (and not use a TFIDF vector for a sentence like explained above). In theory, is it a good choice..what do you think?
Note: 31635 is the size of the corpus (number of words in all sentences combined). So each sentence has the length of 31635, but it will be mostly sparse (padded) because the longest sentence in my input is about 300 words.