Counting full, case insensitive words using scrapy and regular expression

115 views Asked by jedmund At 18 May 2022 at 15:10

I am using a scrapy crawl spider to count the number of instances of specific words on each page in a domain. So far, my code is generally successful in doing so, but I would like it to be case insensitive and to only count full words. For example, if I am counting the number of times 'demo' appears, I would like it to also count 'Demo' and 'DEMO' but not 'democracy'. Here is what I have so far:

    def parse_item(self, response):
         yield{
             'demo': len(response.css('body').re('demo')),
             }

For the case sensitivity issue, I have found advice that suggests using xpath's translate or re.ignorecase. For the full words only issue, I have found advice on using word boundaries. However, I am not sure how to incorporate any of them in this situation. I have tried and failed a number of times.

Edit

The following fix solves the problem:

    def parse_item(self, response):
         yield{
             'demo': len(response.css('body').re(r'(?i)\bdemo\b')),
             }

Original Q&A

TechQA.

Counting full, case insensitive words using scrapy and regular expression

Edit

There are 0 answers

Related Questions in REGEX

Related Questions in SCRAPY

Related Questions in CASE-INSENSITIVE

Related Questions in WORD-BOUNDARIES

Popular Questions

Popular Tags

Trending Questions