I'm trying to build a model that can determine how related a string of text is to a predefined topic and have tried several methods (LDA with seedwords, Naive Bayes mainly) but can't really get the desired results.

I have a list with two topics "inside" and "outside" and several words related to each of the topics

Inside Outside
Production Clients
Marketing Suppliers
Finance Banks
etc. etc.

The text I want to analyze is contained in columns with for example a text like: banks_production_clients

Moreover, I have about 1115 documents with each related to several columns (about 200 each).

I want my model to recognize that this contains two words that belong to the topic "outside" and one that belongs to the topic "inside". So, this makes it something like 0.67 related to outside and 0.33 related to inside. In the end, I want to see how much each document (with 200 of these columns) relates to either topic.

The occurrences of the words differ highly, so when running an LDA, the highly occurring words were grouped together because they also occur together a lot more often.

0

There are 0 answers