I am trying to get my code to scrape http://www.pythonscraping.com/pages/warandpeace.html and then print out 10 most frequent English words. However, the code I have simply finds the most frequent paragraphs/sentences instead of word. So instead of getting top ten most frequent words, I get this junk:

[("Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news.", 1),
('If you have nothing better to do, Count [or Prince], and if the\nprospect of spending an evening with a poor invalid is not too\nterrible, I shall be very charmed to see you tonight between 7 and 10-\nAnnette Scherer.',   1),
('Heavens! what a virulent attack!', 1),
("First of all, dear friend, tell me how you are. Set your friend's\nmind at rest,",   1),
('Can one be well while suffering morally? Can one be calm in times\nlike these if one has any feeling?',   1),
('You are\nstaying the whole evening, I hope?', 1), 
("And the fete at the English ambassador's? Today is Wednesday. I\nmust put in an appearance there,",   1),
('My daughter is\ncoming for me to take me there.', 1),
("I thought today's fete had been canceled. I confess all these\nfestivities and fireworks are becoming wearisome.",   1),

My code is:

import nltk

from nltk.corpus import stopwords
from nltk import word_tokenize

stop_words = set(stopwords.words('english'))

from urllib.request import urlopen
from bs4 import BeautifulSoup

nameList = [tag.text for tag in soup.findAll("span", {"class":"red"})]

filtered_words = [word for word in nameList if word not in stopwords.words('english')]  

fdist1 = nltk.FreqDist(nameList)

I tried to tokenise namelist by adding "token = nltk.word_tokenize(nameList)" but I end up with the TypeError: expected string or bytes-like object.

Does the tokenizer work with web scrapping? I have also tried splitting by using nameList.split() but then I end up with AttributeError: 'list' object has no attribute 'split'

How do I get this chunk of text to be individual words?

2 Answers

Community On Best Solutions

nameList is a list with texts. It contains no words itself and you can't process it correctly. You have the following errors:

  1. You are searching in text, not in words in text
  2. FreqDict is searching in nameList (with text), not in filtered_words

You should replace your last block of code with it:

# Remember filtered words between texts
filtered_words = []
# Check all texts
for text in nameList:
    # Replace EOLs with ' ', split by ' ' and filter stopwords
    filtered_words += [word for word in text.replace('\n', ' ').split(' ') if word not in stopwords.words('english')]

# Search in stopwords
fdist1 = nltk.FreqDist(filtered_words)

Moreover, nltk has a submodule tokenize that can (and should) be used instead of manual splitting. It is better for natural texts:



['Heavens', '!', 'what', 'a', 'virulent', 'attack', '!']
Klemen Koleša On

Maybe something like this could help you:

First use re.split() on each element(sentence) in your nameList

import re
nameList_splitted=[re.split(';|,|\n| ',x) for x in nameList]

As a result you will get list of lists of individual words which you can then combine in one final list like this:

for list_ in nameList_spaces:
    list_of_words += list_

result is: