This is the code from Python 2.7 HtmlParser:
entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
Previously, I assumed it to be more like this:
entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);')
so it caught me by surprise on some strange data from some strange source.
My use-case is irrelevant; is there any reason to define the entity reference like HtmlParser?
irrelevant use-case: Should anyone wonder, I describe my use-case nevertheless. Please note that I am not trying to solve my use-case anymore. My question is whether HtmlParser's entityref is buggy.
My use-case is similar to this: Strip HTML from strings in Python
The input data I was speaking about is like this:
r'''<foo bar="blah"> asda&Il_'d@m_'<foo rab="halb">'''
The intended output from my use-case would have been r"""a&Il_'d@m_'"""
.
edit I was trying to compare the regex to this sgml reference and in my understanding the entity reference should end with ;
but I'm not that familiar with the topic, so I wanted to ask.
The syntactic production for
reference end
reads:That means that the following are recognized as reference ends:
;
in the reference syntax), as you expected?
metacharacter after the close parenthesis, meaning that both REFC and RE are optional)If nothing is used as a reference end, the reference ends at the first non-name character after the name start character, as required by the rules of the reference recognition mode that has been entered at the Entity Reference Open delimiter (ERO
&
).Note also that ERO is only used for the general entity reference production.