Long text topic modelling differences

107 views Asked by At

I have some very long documents. They have overall topics that are fairly standard, but each document will emphasise the topics differently AND within those topics they will have different subtopics

I would like to determine 1. The importance/ probabilities of each topic within each document (i.e. document 1 put more emphasis on topic 3 than document 2 did) and 2. The subtopics and their probabilities of each topic.

I have mostly seen bertopic and top2vec for short text like tweets.

Would they be an appropriate strategy for very long documents? Is there a better strategy for very long documents?

1

There are 1 answers

0
gojomo On

You have to try them (and other classic methods like LDA) with your documents, and your goals, to evaluate their applicability. No external authority, having only a vague idea of what's available & important to your project, can give an a priori assessment of what will work, or be practical/optimal.

And, after you try various techniques and observe where they work or don't, and have a better idea of what you'd hoped for but was lacking, then you'll be able to ask more-detailed questions that could generate better insight.

Most topic-modeling options will offer a relative-score for each topic, by document. So yes, you'll have a sense of which documents are relatively-more associated with certain topics.

Many methods don't necessarily create hierarchical "sub-topics" of other higher-level topics, so if that's a requirement, it might require extra effort/steps.

If your documents are especially long, you may find it useful to split them into subdocuments, so that you get topic-analysis that's more sensitive to the full diversity of the documents, and can point to specific places where topics reside. Such splits would ideally match the document's own sections/chapters – but even a purely mechanical split may help you detect/characterize finer shifts in topic than a full-large-document analysis would reveal.