Python scraping and outputting to excel

1.2k views Asked by At

I am trying to create a web crawler. I am currently just testing it on Youtube, but I intend to expand it to do more later. For now, I am still learning.

Currently I am trying to export the information to a csv, the code below is what I have at the moment and it seemed to be working great when I was running it to pull title descriptions. However, when I added in code to get the "views" and "likes", it messes up the output file because they have commas in them.

Does anyone know what I can do to get around this?

import urllib2
import __builtin__
from selenium import webdriver
from selenium.common.exceptions import NoSuchAttributeException
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
from time import sleep
from random import randint
from lxml import etree

browser = webdriver.Firefox()
time.sleep(2)
browser.get("https://www.youtube.com/results?search_query=funny")
time.sleep(2)
browser.find_element_by_xpath("//*[@id='section-list']/li/ol/li[1]/div/div/div[2]/h3/a").click()
time.sleep(2)
url = browser.current_url
title = browser.find_element_by_xpath("//*[@id='eow-title']").text
views = browser.find_element_by_xpath("//*[@id='watch7-views-info']/div[1]").text
likes = browser.find_element_by_xpath("//*[@id='watch-like']/span").text
dislikes = browser.find_element_by_xpath("//*[@id='watch-dislike']/span").text
tf = 'textfile.csv'
f2 = open(tf, 'a+')
f2.write(', '.join([data.encode('utf-8') for data in [url]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [title]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [views]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [likes]]) + ',')
f2.write(', '.join([data.encode('utf-8') for data in [dislikes]]) + '\n')
f2.close()
2

There are 2 answers

0
Oliver W. On

First, the fact that you see those numbers with commas rather than a point is dependant on the language and regional settings that youtube detects for your browser.

Once you have your views, likes and dislikes as strings, you could perform an operation like the following to get rid of the commas:

likes = "3,141,592"
likes = likes.replace(',', '')  # likes is now: "3141592"
likes = int(likes)  # likes is now an actual integer, not just a string

This works because those 3 parameters are all integers, so you don't have to start thinking of commas or points that are actually important to indicate the start of the non-integer part.

Finally, good examples on how to use the csv module are everywhere on the internet. I could suggest the one from Python Module of the Week. If you understand the examples, you'll be able to change your code to use this highly efficient module.

0
Binux On

You needn't write raw csv format yourself. Use https://docs.python.org/2/library/csv.html.

a sample code:

stringio = StringIO.StringIO()
csv_writer = csv.writer(stringio)
csv_writer.writerow([data.encode('utf-8') for data in [url]])
csv_writer.writerow([data.encode('utf-8') for data in [title]])
csv_writer.writerow([data.encode('utf-8') for data in [views]])
csv_writer.writerow([data.encode('utf-8') for data in [likes]])
csv_writer.writerow([data.encode('utf-8') for data in [dislikes]])
with open('textfile.csv') as fp:
  fp.write(stringio.getvalue())

I can't understand the purpose of [data.encode('utf-8') for data in [url]] or you mean:

csv_writer.writerow([data.encode('utf-8') for data in [url, title, views, likes, dislikes]])

you can also try csv.writer(open('textfile.csv', 'a+')) without writing to a string buffer.