Writing to scraped links to a CSV file using Python3

1.5k views Asked by At

I have scraped a website for html links and have a result of about 500 links. When I try to write them to a csv file, I do not get the list only the base page.

Here is my code:

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
writer.writerow([web_links])
csvfile.close()

I only get two lines in my csv file. The header 'Links' and www.census.gov. I have tried making it different by add another for loop in the csv writer area, but I get similar results.

for link in soup.find_all('a'):
    web_links = link.get('href')
    abs_url = join(page, web_links)
    print(abs_url)
    if abs_url and abs_url not in link_set:
        writer.write(str(abs_url) + "\n")
        link_set.add(abs_url)

It seems the 'web_links' definition should be where I put all the links into the csv file, but no dice. Where am I making my mistake?

2

There are 2 answers

2
Akash KC On BEST ANSWER

In your code, you are writing two row in csv i.e.

 writer.writerow(['Links'])
 writer.writerow([web_links]) 

Here web_links is the last instance of retrieved href value.

I don't see the use of set instance. You can print and write in the csv without using set instance in following way :

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in soup.find_all('a'):
    web_links = link.get("href")
    if web_links:
        print(web_links)
        writer.writerow([web_links])
csvfile.close()
2
wp78de On

You have never added the scrapped links to your set():

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)
    link_set.add(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in link_set:
    writer.writerow([link])
csvfile.close()