I have some of these constructions in my code:
from pyparsing import Char, srange
_non_ascii = Char(srange('[\x80-\U0010FFFF]'))
The generation of the ranges is extremely slow, taking 6-8 seconds (huh?) even with Python 3.12 on a relatively decent machine.
Why is this happening and what should I replace those with?
This is what the
srange()function looks like (taken from GitHub):As you (or I) can see,
pyparsingis parsing the argument using_reBracketExpr, which is itself apyparsingexpression, then pass the result of that to_expanded._expandedwill create a new string ifpis aParseResult(range) or returnpunchanged otherwise (single character). In the end, all those strings are joined, creating an even bigger string.In your (or my) case, each
Char(srange())creates a0x10FFFF-0x80+1= ~1.1mcharacter string. Fortunately, since there is only one element, that string is reused instead of joined to create the eventual result. But that's not it; not yet.Word, which gets passed the string asChar's superclass, is also doing something on its own:I removed unreachable sections not applicable to the specific problem at hand (no
as_keywordand such) but some of them also create new strings.All in all, this is a lengthy and time-consuming process just to create a native
re.Patterninstance (anyone counted how many times it iterated the original 1.1m character string?). That said, I switched toRegexwhich directly makes use ofre:No giant strings created, and is more or less the same what a normal
Charwas doing.