How to compare csv files

60 views Asked by At

I want to compare two csv files and get the output of the common racist words. I don´t want the whole solution, but my problem is that i don´t get it how to compare it. I don´t know how to start and hope that you can help me.

Here is my code (I've renamed the file to example - it´s an project)

import csv
import re
import urllib.request

import pandas as pd
from bs4 import BeautifulSoup
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from numpy.core.setup_common import file

url = 'https://'
ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl, 'html.parser')
print(soup.prettify)

toggle = []
for i in soup.find_all('article', {'id': 'post-113572'}):
    per_toggle = i.find('div')
    print(per_toggle)
    toggle.append(per_toggle)

    New_toggle = []
    for each in toggle:
        new_each = str(each).replace('<br/', '')
        pattern = re.compile('<.*?>')
        result = re.sub(pattern, '', new_each)
        print(result)
        New_toggle.append(result)
        df = pd.DataFrame(New_toggle)
df.to_csv('example.csv')
df = pd.read_csv('example.csv')

reader = csv.reader(open('example.csv', 'r', encoding='utf-8'), delimiter=",", quotechar='|')
New_data = []
for line in reader:
    for field in line:
        tokens = word_tokenize(field, language='german')
    posData = pos_tag(tokens)
    print(posData)
    New_data.append(posData)
    df1 = pd.DataFrame(New_data)
    df1.to_csv('ExTokenization.csv', index=False, encoding="utf-8")

    with open('ExTokenization.csv', encoding='utf-8') as infile, open('example.csv', "a",
                                                                      encoding='utf-8') as outfile:
        for line in infile:
            outfile.write("\n" + line.lower())

reader = csv.reader(open('rassismus.csv', 'r', encoding='utf-8'), delimiter=",", quotechar='|')
New_data = []
for line in reader:
    for field in line:
        tokens = word_tokenize(field, language='german')
    posData = pos_tag(tokens)
    print(posData)
    New_data.append(posData)
    df2 = pd.DataFrame(New_data)
    df2.to_csv('RassismusTokenization.csv', index=False)

This is my whole code. so as you can see i have RassismusTokenization (a csv with a list of racist words) and ExTokenization.csv with texts from a website).

I hope that anyone can give me an hint. I want to compare the words from the racist lists with the text and want to get the result: in the text are: X racist words. The racist words are: .....

Thanks everyone for hints!!!

1

There are 1 answers

0
KingOtto On

So, just as one idea: you might want to find all words that are present in both sets of words (the text, assuming it's a set of words, and your set defining the racist words)

[word for word in set_1 if word in set_2]

Could be pretty slow if both sets/texts are long

If your pandas dataframes contain the words, you could intersect them:

df1.set_index('racist_words').index.intersection(df2.set_index('text_words').index)

Before, you should .drop_duplicates()