How can I exclude words from frequency word analysis in a list of articles in python?

179 views Asked by At

I have a dataframe df with a column "Content" that contains a list of articles extracted from the internet. I have already the code for constructing a dataframe with the expected output (two columns, one for the word and the other for its frequency). However, I would like to exclude some words (conectors, for instance) in the analysis. Below you will find my code, what should I add to it?

It is possible to use the code get_stop_words('fr') for a more efficiente use? (Since my articles are in French).

Source Code

    import csv
    from collections import Counter
    from collections import defaultdict

    import pandas as pd


    df = pd.read_excel('C:/.../df_clean.xlsx', 
                                sheet_name='Articles Scraping')
    df = df[df['Content'].notnull()]
    d1 = dict()

    for line in df[df.columns[6]]:
        words = line.split()
        # print(words)
        for word in words:
            if word in d1:
                d1[word] += 1
            else:
                d1[word] = 1

    sort_words = sorted(d1.items(), key=lambda x: x[1], reverse=True)
1

There are 1 answers

2
DarknessPlusPlus On

There are a few ways you can achieve this. You can either use the isin() method with a list comprehension,

data = {'test': ['x', 'NaN', 'y', 'z', 'gamma',]}

df = pd.DataFrame(data)

words = ['x', 'y', 'NaN']

df = df[~df.test.isin([word for word in words])]

Or you can go with not string contains and a join:

df = df[~df.test.str.contains('|'.join(words))]

If you want to utilize the stop words package for French, you can also do that, but you must preprocess all of your texts before you start doing any frequency analysis.

french_stopwords = set(stopwords.stopwords("fr")) 
        
STOPWORDS = list(french_stopwords)
STOPWORDS.extend(['add', 'new', 'words', 'here'])

I think the extend() will help you tremendously.