SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:
http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722
Now I have the xpath selecting the 'article' div class and then subsequent <p>
's I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me :(
The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing...
The problem is that on different pages of this site, there are different numbers of <p>
tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.
How would you go about eliminating those certain <p>
elements towards the end? do I just have to try and scrub my data afterwards?
Here is the code snippet that selects the path and eliminates the first <p>
and the last two. How should I change it?
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[\uac00-\ud7af]+')
# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]
You can tweak your XPath expression not to include the
p
tag havingclass="adrs"
(the date of publishing):