TF-IDF vector vs a vector of tokens

239 views Asked by At

Some background:

I have data structured as the a TFIDF vector of shape (15637, 31635) and this is the input this vector into the LSTM layer. The longest word in my vocabulary is 305 words and each TFIDF vector has length 31635 because the total vocabulary in the corpus has these many words.

Each of the 15637 sentences is a TFIDF vector of form (31635, ).

I am using the TFIDF instead of a pre-trained embedding layer.

No_of_sentences = 15637

BATCH_SIZE = 64

steps_per_epoch = 15637/64 = 244 (with remainder dropped)

vocab_inp_size = 31635. #These were tokens created by Keras tokenizer. and are the distinct words in the input corpus

vocab_tar_size = 4. #This is One-Hot encoding of target value.

The code below first creates tensor slices, then batches the tensor slices and finally enumerates each batch to give a tuple of form: (batch, (input_tensor, target_tensor)).

dataset = tf.data.Dataset.from_tensor_slices((input_tfidfVector, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) # this is where batching happens

for (batch) in enumerate(dataset.take(steps_per_epoch)):`
   print (batch) #this will print the tuple: curent batch (batch 0) but also the input and the target tensor 

(0, (<tf.Tensor: shape=(64, 31635), dtype=float64, numpy=
array([[0.        , 1.74502835, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.74502835, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.74502835, 3.35343652, ..., 0.        , 0.        ,
        0.        ]])>, <tf.Tensor: shape=(64, 1), dtype=int32, numpy=
array([[3],
       [1],
       [2],
       [1],
       [3],
       [1],
       [1],
       [1],
       [1],
       [2],
       [2],
       [2],
       [3],
       [2],
       [2],
       [2],
       [2],
       [2],
       [1],
       [2],
       [1],
       [2],
       [3],
       [2],
       [3],
       [1],
       [1],
       [1],
       [3],
       [1],
       [1],
       [2],
       [2],
       [2],
       [2],
       [2],
       [2],
       [3],
       [3],
       [1],
       [1],
       [3],
       [1],
       [1],
       [1],
       [2],
       [1],
       [1],
       [3],
       [2],
       [1],
       [3],
       [1],
       [3],
       [3],
       [1],
       [2],
       [1],
       [1],
       [1],
       [2],
       [1],
       [1],
       [1]], dtype=int32)>))

Question:

I am not using a pre-trained embedding layer - but a TFIDF vector for each sentence. I am not removing stop words from the input - so TFIDF would downweight any words that are too frequent across the corpus.

Let's say I just use tokens created by the keras tokenizer (and not use a TFIDF vector for a sentence like explained above). In theory, is it a good choice..what do you think?

Note: 31635 is the size of the corpus (number of words in all sentences combined). So each sentence has the length of 31635, but it will be mostly sparse (padded) because the longest sentence in my input is about 300 words.

0

There are 0 answers