I followed this tutorial (parse/scrape with python requests-html) successfully . However, as I was about to adjust the code to add a column that contains the url, but then I realized that the class I was about to use (.question-hyperlink) was already used to parse the question itself.
How would you add a url column to this code?
result:
https://i.stack.imgur.com/xZ4hD.jpg
attempt:
def parse_tagged_page(html):
question_summaries = html.find(".question-summary")
key_names = ['question', 'votes', 'tags','summary', 'url']
classes_needed = ['.question-hyperlink', '.vote', '.tags', '.summary', '.question-hyperlink' ]
datas = []
for q_el in question_summaries:
question_data = {}
for i, _class in enumerate(classes_needed):
sub_el = q_el.find(_class, first=True)
keyname = key_names[i]
question_data[keyname] = clean_scraped_data(sub_el.text, keyname=keyname)
datas.append(question_data)
return datas
URL is contained in
href
attribute of thea
element and passingsub_el.text
to functionclean_scraped_data()
will not help. You probably should refactor this function:Accordingly should be adjusted the function call: