Writing to scraped links to a CSV file using Python3

Question

Writing to scraped links to a CSV file using Python3

1.5k views Asked by James Desgrange At 19 November 2017 at 01:48

I have scraped a website for html links and have a result of about 500 links. When I try to write them to a csv file, I do not get the list only the base page.

Here is my code:

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
writer.writerow([web_links])
csvfile.close()

I only get two lines in my csv file. The header 'Links' and www.census.gov. I have tried making it different by add another for loop in the csv writer area, but I get similar results.

for link in soup.find_all('a'):
    web_links = link.get('href')
    abs_url = join(page, web_links)
    print(abs_url)
    if abs_url and abs_url not in link_set:
        writer.write(str(abs_url) + "\n")
        link_set.add(abs_url)

It seems the 'web_links' definition should be where I put all the links into the csv file, but no dice. Where am I making my mistake?

Original Q&A

There are 2 answers

wp78de On 19 November 2017 at 02:00

You have never added the scrapped links to your set():

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)
    link_set.add(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in link_set:
    writer.writerow([link])
csvfile.close()

**Akash KC** · Accepted Answer · 2017-11-19T02:46:29+00:00

In your code, you are writing two row in csv i.e.

 writer.writerow(['Links'])
 writer.writerow([web_links])

Here web_links is the last instance of retrieved href value.

I don't see the use of set instance. You can print and write in the csv without using set instance in following way :

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in soup.find_all('a'):
    web_links = link.get("href")
    if web_links:
        print(web_links)
        writer.writerow([web_links])
csvfile.close()

TechQA.

Writing to scraped links to a CSV file using Python3

There are 2 answers

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in CSV

Related Questions in BEAUTIFULSOUP

Popular Questions

Trending Questions