I am trying to parse html text from a number of webpages for sentiment analysis. With the help from community I have been able to iterate over many urls and produce sentiment score based on the textblob library's sentiment analysis and have used the print function successfully to output a score for each url. However I have not been able to achieve, putting the many outputs produced by my return variable into a list so I can use to continue my analysis further by using the stored numbers for calculating averages, and displaying my results in a graph later.
Code with print function:
import requests
import json
import urllib
from bs4 import BeautifulSoup
from textblob import TextBlob
#you can add to this
urls = ["http://www.thestar.com/business/economy/2015/05/19/canadian-consumer-confidence-dips-but-continues-to-climb-in-us-report.html",
"http://globalnews.ca/news/2012054/canada-ripe-for-an-invasion-of-u-s-dollar-stores-experts-say/",
"http://www.cp24.com/news/tsx-flat-in-advance-of-fed-minutes-loonie-oil-prices-stabilize-1.2381931",
"http://www.marketpulse.com/20150522/us-and-canadian-gdp-to-close-out-week-in-fx/",
"http://www.theglobeandmail.com/report-on-business/canada-pension-plan-fund-sees-best-ever-annual-return/article24546796/",
"http://www.marketpulse.com/20150522/canadas-april-inflation-slowest-in-two-years/"]
def parse_websites(list_of_urls):
for url in list_of_urls:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
#print(text)
wiki = TextBlob(text)
r = wiki.sentiment.polarity
print r
parse_websites(urls)
output:
>>>
0.10863027172
0.156074203574
0.0766585497835
0.0315555555556
0.0752548359411
0.0902824858757
>>>
but when I use the return variable to form a list to use the values to work with I get no result, code:
import requests
import json
import urllib
from bs4 import BeautifulSoup
from textblob import TextBlob
#you can add to this
urls = ["http://www.thestar.com/business/economy/2015/05/19/canadian-consumer-confidence-dips-but-continues-to-climb-in-us-report.html",
"http://globalnews.ca/news/2012054/canada-ripe-for-an-invasion-of-u-s-dollar-stores-experts-say/",
"http://www.cp24.com/news/tsx-flat-in-advance-of-fed-minutes-loonie-oil-prices-stabilize-1.2381931",
"http://www.marketpulse.com/20150522/us-and-canadian-gdp-to-close-out-week-in-fx/",
"http://www.theglobeandmail.com/report-on-business/canada-pension-plan-fund-sees-best-ever-annual-return/article24546796/",
"http://www.marketpulse.com/20150522/canadas-april-inflation-slowest-in-two-years/"]
def parse_websites(list_of_urls):
for url in list_of_urls:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
#print(text)
wiki = TextBlob(text)
r = wiki.sentiment.polarity
r = []
return [r]
parse_websites(urls)
output:
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
>>>
How can I make it so I can work with the numbers and be able to add, subtract, them from list like so [r1, r2, r3...]
Thank you in advance.
From your code below, you are asking python to return an empty list:
If I understood your issue correctly, all you have to do is:
Alternatively, you could creat a dictionary with the url as the key
A dictionary may make more sense for you, as it would be easier to retrieve, edit, and delete "r", using the url used as a key.
I am kind of new to Python, so hopefully others will correct me if this doesn't make sense...