How to Handle Out-of-Period Terms in Dynamic Topic Modeling (DTM) using Gensim?

34 views Asked by At

I'm currently working on a Dynamic Topic Modeling (DTM) project using Python's Gensim library to analyze a dataset consisting of 20,000 online Chinese reviews of tourist attractions. During the modeling process, I've noticed that certain terms appear in time periods where they seemingly shouldn't, and I suspect this may be due to the model's smoothing process.Specifically, terms like "epidemic" are occurring in time periods before the outbreak of the COVID-19 pandemic, which is unexpected.

Here is my code:

pos_df['date'] = pd.to_datetime(pos_df['date'], unit='ms')
pos_df.sort_values(by='date', ascending=True, inplace=True)
pos_df.set_index("date", inplace=True)
# Define time slices
pos_time_slice = [pos_df[:'2017-07-08'].count()[0],
                  pos_df['2017-07-09':'2020-01-03'].count()[0],
                  pos_df['2020-01-04':'2022-12-07'].count()[0],
                  pos_df['2022-12-08':].count()[0]]

# Function to preprocess data and create dictionary
def get_dic(data_df, col, no_below, no_above):
    texts = data_df[col].apply(lambda x: ' '.join(eval(x)))
    texts = [simple_preprocess(text) for text in texts]
    dictionary = Dictionary(texts)
    if len(texts) > 10000:
        dictionary.filter_extremes(no_below=no_below, no_above=no_above)
    else:
        dictionary.filter_extremes(no_below=10, no_above=0.4)
    pos_corpus = [dictionary.doc2bow(text) for text in texts]
    return texts, pos_corpus, dictionary

# Get dictionary and corpus
pos_texts, pos_corpus, pos_dictionary = get_dic(pos_df, "segmented comments", pos_below, pos_above)

# Build DTM model
pos_DTM = ldaseqmodel.LdaSeqModel(corpus=pos_corpus, id2word=pos_dictionary, time_slice=pos_time_slice,num_topics=pos_topic_num, em_max_iter=500)

I've tried adjusting parameters such as the number of iterations, but this significantly slows down the computation process.During the modeling process, I'm seeking to adjust the smoothing parameter in the model to better capture the evolution of topics over time. However, I couldn't find any relevant variables or options in the current codebase.

I'm seeking insights on:

  1. Possible reasons for the occurrence of out-of-period terms in DTM results.
  2. Strategies to handle or mitigate this issue, such as adjusting smoothing parameters or refining preprocessing techniques.
  3. Suggestions for optimizing the efficiency of Gensim's DTM implementation.

Any guidance, advice, or references would be greatly appreciated. Thank you!

0

There are 0 answers