I want to compare two csv files and get the output of the common racist words. I don´t want the whole solution, but my problem is that i don´t get it how to compare it. I don´t know how to start and hope that you can help me.
Here is my code (I've renamed the file to example - it´s an project)
import csv
import re
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from numpy.core.setup_common import file
url = 'https://'
ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl, 'html.parser')
print(soup.prettify)
toggle = []
for i in soup.find_all('article', {'id': 'post-113572'}):
per_toggle = i.find('div')
print(per_toggle)
toggle.append(per_toggle)
New_toggle = []
for each in toggle:
new_each = str(each).replace('<br/', '')
pattern = re.compile('<.*?>')
result = re.sub(pattern, '', new_each)
print(result)
New_toggle.append(result)
df = pd.DataFrame(New_toggle)
df.to_csv('example.csv')
df = pd.read_csv('example.csv')
reader = csv.reader(open('example.csv', 'r', encoding='utf-8'), delimiter=",", quotechar='|')
New_data = []
for line in reader:
for field in line:
tokens = word_tokenize(field, language='german')
posData = pos_tag(tokens)
print(posData)
New_data.append(posData)
df1 = pd.DataFrame(New_data)
df1.to_csv('ExTokenization.csv', index=False, encoding="utf-8")
with open('ExTokenization.csv', encoding='utf-8') as infile, open('example.csv', "a",
encoding='utf-8') as outfile:
for line in infile:
outfile.write("\n" + line.lower())
reader = csv.reader(open('rassismus.csv', 'r', encoding='utf-8'), delimiter=",", quotechar='|')
New_data = []
for line in reader:
for field in line:
tokens = word_tokenize(field, language='german')
posData = pos_tag(tokens)
print(posData)
New_data.append(posData)
df2 = pd.DataFrame(New_data)
df2.to_csv('RassismusTokenization.csv', index=False)
This is my whole code. so as you can see i have RassismusTokenization (a csv with a list of racist words) and ExTokenization.csv with texts from a website).
I hope that anyone can give me an hint. I want to compare the words from the racist lists with the text and want to get the result: in the text are: X racist words. The racist words are: .....
Thanks everyone for hints!!!
So, just as one idea: you might want to find all words that are present in both sets of words (the text, assuming it's a set of words, and your set defining the racist words)
Could be pretty slow if both sets/texts are long
If your pandas dataframes contain the words, you could intersect them:
Before, you should
.drop_duplicates()