How to eliminate certain elements when scraping?

Question

How to eliminate certain elements when scraping?

73 views Asked by Joff At 05 June 2015 at 07:59

SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:

http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722

Now I have the xpath selecting the 'article' div class and then subsequent <p>'s I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me :(

The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing...

The problem is that on different pages of this site, there are different numbers of <p> tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.

How would you go about eliminating those certain <p> elements towards the end? do I just have to try and scrub my data afterwards?

Here is the code snippet that selects the path and eliminates the first <p> and the last two. How should I change it?

# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[\uac00-\ud7af]+')

# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]

Original Q&A

There are 2 answers

LarsH On 05 June 2015 at 10:22

Adding to alecxe's answer, you could exclude the p containing the email address using something that checks for an email address (possibly surrounded by whitespace). How to do that depends on whether you have XPath 2.0 or just 1.0. In 2.0 you could do something like:

//*[@class="article"]/p[not(contains(@class, "adrs")
       or text()[matches(normalize-space(.),
                   "^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$", "i")])]//text()

adapting the regex for email addresses from http://www.regular-expressions.info/email.html. You could change the \.[A-Z]{2,4} to \.kr if you like.

**alecxe** · Accepted Answer · 2015-06-05T08:25:50+00:00

alecxe On 05 June 2015 at 08:25 BEST ANSWER

You can tweak your XPath expression not to include the p tag having class="adrs" (the date of publishing):

//*[@class="article"]/p[not(contains(@class, "adrs"))]//text()

TechQA.

How to eliminate certain elements when scraping?

There are 2 answers

Related Questions in PYTHON

Related Questions in XPATH

Related Questions in WEB-SCRAPING

Related Questions in SCRAPY

Popular Questions

Popular Tags

Trending Questions