How to eliminate certain elements when scraping?

90 views Asked by At

SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:

http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722

Now I have the xpath selecting the 'article' div class and then subsequent <p>'s I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me :(

The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing...

The problem is that on different pages of this site, there are different numbers of <p> tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.

How would you go about eliminating those certain <p> elements towards the end? do I just have to try and scrub my data afterwards?

Here is the code snippet that selects the path and eliminates the first <p> and the last two. How should I change it?

# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[\uac00-\ud7af]+')

# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]
2

There are 2 answers

0
alecxe On BEST ANSWER

You can tweak your XPath expression not to include the p tag having class="adrs" (the date of publishing):

//*[@class="article"]/p[not(contains(@class, "adrs"))]//text()
1
LarsH On

Adding to alecxe's answer, you could exclude the p containing the email address using something that checks for an email address (possibly surrounded by whitespace). How to do that depends on whether you have XPath 2.0 or just 1.0. In 2.0 you could do something like:

//*[@class="article"]/p[not(contains(@class, "adrs")
       or text()[matches(normalize-space(.),
                   "^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$", "i")])]//text()

adapting the regex for email addresses from http://www.regular-expressions.info/email.html. You could change the \.[A-Z]{2,4} to \.kr if you like.