Constructing large character ranges with pyparsing.srange is extremely slow

45 views Asked by At

I have some of these constructions in my code:

from pyparsing import Char, srange

_non_ascii = Char(srange('[\x80-\U0010FFFF]'))

The generation of the ranges is extremely slow, taking 6-8 seconds (huh?) even with Python 3.12 on a relatively decent machine.

Why is this happening and what should I replace those with?

1

There are 1 answers

0
InSync On BEST ANSWER

This is what the srange() function looks like (taken from GitHub):

_charRange = Group(_singleChar + Suppress("-") + _singleChar)
_reBracketExpr = (
    Literal("[")
    + Opt("^").set_results_name("negate")
    + Group(OneOrMore(_charRange | _singleChar)).set_results_name("body")
    + Literal("]")
)

def srange(s: str) -> str:
    _expanded = (
        lambda p: p
        if not isinstance(p, ParseResults)
        else "".join(chr(c) for c in range(ord(p[0]), ord(p[1]) + 1))
    )
    try:
        return "".join(_expanded(part) for part in _reBracketExpr.parse_string(s).body)
    except Exception as e:
        return ""

As you (or I) can see, pyparsing is parsing the argument using _reBracketExpr, which is itself a pyparsing expression, then pass the result of that to _expanded. _expanded will create a new string if p is a ParseResult (range) or return p unchanged otherwise (single character). In the end, all those strings are joined, creating an even bigger string.

In your (or my) case, each Char(srange()) creates a 0x10FFFF - 0x80 + 1 = ~1.1m character string. Fortunately, since there is only one element, that string is reused instead of joined to create the eventual result. But that's not it; not yet. Word, which gets passed the string as Char's superclass, is also doing something on its own:

class Word(Token):

    def __init__(self, init_chars: str = "", ..., *, initChars: str = "", ...):
        initChars = initChars or init_chars
        ...
        # One full iteration
        initChars_set = set(initChars)
        ...
        self.initChars = initChars_set
        # O(n log n) sorting, then another join
        self.initCharsOrig = "".join(sorted(initChars_set))
        
        ...
        
        # __or__() creates a new set, which requires another full iteration.
        if " " not in (self.initChars | self.bodyChars):  
            if len(self.initChars) == 1:
                ...
            else:
                # New string, again.
                # This function is slow too, explained below
                re_leading_fragment = f"[{_collapse_string_to_ranges(self.initChars)}]"

            if self.bodyChars == self.initChars:
                ...
            else:
                ...
                # Yet another new string
                self.reString = f"{re_leading_fragment}{re_body_fragment}{repeat}"
            ...
            self.re = re.compile(self.reString)
            ...
def _collapse_string_to_ranges(
    s: Union[str, Iterable[str]], re_escape: bool = True
) -> str:
    ...

    ret = []

    # Haven't we been here before?
    s = "".join(sorted(set(s)))

    if len(s) > 3:
        # Yet another full iteration
        for _, chars in itertools.groupby(s, key=is_consecutive):
            ...  # Itertools dark magic to append to ret.
            # Mind you, list.append() is amortized O(1).
    else:
        ...

    return "".join(ret)  # A third .join()

I removed unreachable sections not applicable to the specific problem at hand (no as_keyword and such) but some of them also create new strings.

All in all, this is a lengthy and time-consuming process just to create a native re.Pattern instance (anyone counted how many times it iterated the original 1.1m character string?). That said, I switched to Regex which directly makes use of re:

_any_char = Regex(r'[\x80-\U0010FFFF]')  # With and without r are both fine.

No giant strings created, and is more or less the same what a normal Char was doing.