Parsing a url link for a tag from a list of url links parsed from a saved html file. And saving it all in a csv ouput

1.1k views Asked by At

How can I make a smooth transition from the Part 1 to Part 2 and to save the results in Part3? So far, I have not been able to parse a scraped url link unless i inserted it into Part 2 myself. Besides, I could not save the output results as the last url link overwrote all the other ones.

import urllib
import mechanize
from bs4 import BeautifulSoup
import os, os.path
import urlparse
import re
import csv

Part 1:

path = '/Users/.../Desktop/parsing/1.html'

f = open(path,"r")
if f.mode == 'r':       
    contents = f.read()

soup = BeautifulSoup(content
search = soup.findAll('div',attrs={'class':'mf_oH mf_nobr mf_pRel'})
searchtext = str(search)
soup1 = BeautifulSoup(searchtext)   

for tag in soup1.findAll('a', href = True):
    raw_url = tag['href'][:-7]
    url = urlparse.urlparse(raw_url)
    p = "http"+str(url.path)

Part 2:

for i in url:
    url = "A SCRAPED URL LINK FROM ABOVE"

    homepage = urllib.urlopen(url)
    soup = BeautifulSoup(homepage)

    for tag in soup.findAll('a',attrs={'name':'g_my.main.right.gifts.link-send'}):
        searchtext = str(tag['href'])
        original = searchtext
        removed = original.replace("gifts?send=", "")
        print removed

Part 3

i = 0
for i in removed:
    f = open("1.csv", "a+")
    f.write(removed)
    i += 1
    f.close

Update 1.After the advice, I still get this: Traceback (most recent call last): File "page.py", line 31, in homepage = urllib.urlopen(url) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen return opener.open(url) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 180, in open fullurl = unwrap(toBytes(fullurl)) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1057, in unwrap url = url.strip() AttributeError: 'ParseResult' object has no attribute 'strip'

1

There are 1 answers

3
Tim Pietzcker On

In part 1, you keep overwriting url with a new URL. You should be using a list and append the URLs to that list:

urls = []
for tag in soup1.findAll('a', href = True):
    raw_url = tag['href'][:-7]
    url = urlparse.urlparse(raw_url)
    urls.append(url)
    p = "http"+str(url.path) # don't know what that's for, you're not using it later

Then, in part 2, you can iterate over urls directly. Again, removed shouldn't be overwritten with each iteration. Also, no need for the variable original - your searchtext won't be changed by a replace operation since it returns a new string and leaves the original alone:

removed_list = []
for url in urls:
    homepage = urllib.urlopen(url)
    soup = BeautifulSoup(homepage)

    for tag in soup.findAll('a',attrs={'name':'g_my.main.right.gifts.link-send'}):
        searchtext = str(tag['href'])
        removed = searchtext.replace("gifts?send=", "")
        print removed
        removed_list.append(removed)

Then, in part 3, you don't have to open and close the file for each line you're outputting. In fact, you weren't even closing it properly because you didn't call the close() method. The proper way is using the with statement anyway:

with open("1.csv", "w") as outfile:
    for item in removed_list:
        outfile.write(item + "\n")

Although I don't see how this is a CSV file (only one item per line?)...