I want to make embeddings and then do a logistic regression. The output data are these
0 [[-0.00034277988, 0.0013405628, -1.998733e-05,...
1 [[0.00075779966, -0.00025276924, 0.0009634475,...
2 [[-0.0032675266, -0.0015163509, 0.0051634307, ...
3 [[0.0006605284, -0.0040500723, 0.0041460698, -...
...
4774 [[0.0005923094, -0.00194318, 0.0015639212, 0.0...
4775 [[-0.002365636, 0.0023984204, -0.0004855222, -...
4776 [[-0.0028686645, 0.0019738101, 0.0037081288, 0...
4777 [[0.0024941873, -0.0019521558, -0.0019918315, ...
Name: Tweet, Length: 4779, dtype: object
But in order to do the regression i need them to be type number so i need each number on a different column: [4778 rows x 768 columns]
My fasttext code is this. I dont know if its better to change the fasttext code or after i have the embeddings ready then do the changes
df = pd.read_csv('OGTDv1.csv')
sentences = [word_tokenize(rev.lower()) for rev in df.Tweet.to_string(index=False)]
model = FastText(sentences, vector_size=128, window=5, min_count=3, workers=4, epochs=10, seed=42)
model.save('tokped_review.ft')
ftext = model.wv
def get_sentence_embeddings(sentence, model):
tokens = word_tokenize(sentence.lower())
embeddings = [model.wv[token] for token in tokens if token in model.wv]
return embeddings
df_emb = df['Tweet'].apply(lambda x: get_sentence_embeddings(x, model))
print(df_emb)
df_emb.to_pickle('ToxicityFastText_Embeddings.pkl')```