I'm using scrapy xpath + re to extract data from web pages. Characters are unicode (russian) and all strings to be extracted contain long dashes (python code '\u2014') The problem is my regex cannot find a full string and splits it by long dash. It's really inconvenient for me. Here is some examples I've already tried and it didn't work:
response.xpath('some xpath goes here').re(r'[\w\s\\u2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\\u2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\\x2014\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\\uFFFF\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s\.,—]+')
response.xpath('some xpath goes here').re(r'[\w\s\u(\w){4}\.,]+')
response.xpath('some xpath goes here').re(r'[\w\s(\u(\d)){6}\.,]+')
Versions: Python 2.7, Scrapy 0.24.6
Turn your patterns into unicode strings and do not escape
\
.Also I guess you might want to use
re.UNICODE
flag so\w
and\s
will match all Unicode word and whitespace characters. According to Scrapy documentationselector.re
doesn't support flags, but it can consume compiled regular expression, so yyou can do this: