I am working on a project where I scrape a number of blogs, and save a selection of the data to a SQLite
database. Such as the title of the post, the date it was posted, and the content of the post.
The goal in the end is to do some fancy textual analyses, but right now I have a problem with writing the data to the database.
I work with the library pattern for Python. (the module about databases can be found here)
I am busy with the third blog now. The data from the two other blogs is already saved in the database, and for the third blog, which is similarly structured, I adapted the code.
There are several functions well integrated with each other, they work fine. I also got access to all the data the right way, when I try it out in IPython Notebook it works fine. When I ran the code as a trial in the Console for only one blog page (it has 43 altogether), it also worked and saved everything nicely in the database. But when I ran it again for 43 pages, it threw a data error.
There are some comments and print statements inside the functions now which I used for debugging. The problem seems to happen in the function parse_post_info
, which passes a dictionary on to the function that goes over all blog pages and opens every single post, and then saves the dictionary that the function parse_post_info
returns IF it is not None, but I think it IS empty because something about the date format goes wrong.
Also - why does the code work once, and the same code throws a dateerror the second time:
DateError: unknown date format for '2015-06-09T07:01:55+00:00'
Here is the function:
from pattern.db import Database, field, pk, date, STRING, INTEGER, BOOLEAN, DATE, NOW, TEXT, TableError, PRIMARY, eq, all
from pattern.web import URL, Element, DOM, plaintext
def parse_post_info(p):
""" This function receives a post Element from the post list and
returns a dictionary with post url, post title, labels, date.
"""
try:
post_header = p("header.entry-header")[0]
title_tag = post_header("a < h1")[0]
post_title = plaintext(title_tag.content)
print post_title
post_url = title_tag("a")[0].href
date_tag = post_header("div.entry-meta")[0]
post_date = plaintext(date_tag("time")[0].datetime).split("T")[0]
#post_date = date(post_date_text)
print post_date
post_id = int(((p).id).split("-")[-1])
post_content = get_post_content(post_url)
labels = " "
print labels
return dict(blog_no=blog_no,
post_title=post_title,
post_url=post_url,
post_date=post_date,
post_id=post_id,
labels=labels,
post_content=post_content
)
except:
pass
The date() function returns a new Date, a convenient subclass of Python's datetime.datetime. It takes an integer (Unix timestamp), a string or NOW.
You can have diff with local time.
Also the format is "YYYY-MM-DD hh:mm:ss".
The convert time format can be found here