scrapy regex cannot find long dash

Question

scrapy regex cannot find long dash

384 views Asked by thepolina At 05 June 2015 at 11:51

I'm using scrapy xpath + re to extract data from web pages. Characters are unicode (russian) and all strings to be extracted contain long dashes (python code '\u2014') The problem is my regex cannot find a full string and splits it by long dash. It's really inconvenient for me. Here is some examples I've already tried and it didn't work:

response.xpath('some xpath goes here').re(r'[\w\s\\u2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\\u2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\\x2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\\uFFFF\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\.,—]+')
response.xpath('some xpath goes here').re(r'[\w\s\u(\w){4}\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s(\u(\d)){6}\.,]+')

Versions: Python 2.7, Scrapy 0.24.6

Original Q&A

There are 1 answers

**Konstantin** · Answer 1 · 2015-06-05T12:35:05+00:00

Turn your patterns into unicode strings and do not escape \.

response.xpath('some xpath goes here').re(ur'[\w\s\u2014\.,]+')

Also I guess you might want to use re.UNICODE flag so \w and \s will match all Unicode word and whitespace characters. According to Scrapy documentation selector.re doesn't support flags, but it can consume compiled regular expression, so yyou can do this:

import re
response.xpath('some xpath goes here').re(re.compile(ur'[\w\s\u2014\.,]+', re.UNICODE))

TechQA.

scrapy regex cannot find long dash

There are 1 answers

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in PYTHON-2.7

Related Questions in SCRAPY

Popular Questions

Popular Tags

Trending Questions