Is HtmlParser.entityref actually a valid regex for matching html entity references?

126 views Asked by At

This is the code from Python 2.7 HtmlParser:

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')

Previously, I assumed it to be more like this:

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);')

so it caught me by surprise on some strange data from some strange source.

My use-case is irrelevant; is there any reason to define the entity reference like HtmlParser?


irrelevant use-case: Should anyone wonder, I describe my use-case nevertheless. Please note that I am not trying to solve my use-case anymore. My question is whether HtmlParser's entityref is buggy.

My use-case is similar to this: Strip HTML from strings in Python

The input data I was speaking about is like this:

r'''<foo bar="blah"> asda&Il_'d@m_'<foo rab="halb">'''

The intended output from my use-case would have been r"""a&Il_'d@m_'""".


edit I was trying to compare the regex to this sgml reference and in my understanding the entity reference should end with ; but I'm not that familiar with the topic, so I wanted to ask.

1

There are 1 answers

0
Javier On BEST ANSWER

The syntactic production for reference end reads:

[61] reference end =
  ( refc | ;
    RE ) ? (13) CR

That means that the following are recognized as reference ends:

  • A REFerence Close delimiter (; in the reference syntax), as you expected
  • A Record End
  • Nothing (note the use of the ? metacharacter after the close parenthesis, meaning that both REFC and RE are optional)

If nothing is used as a reference end, the reference ends at the first non-name character after the name start character, as required by the rules of the reference recognition mode that has been entered at the Entity Reference Open delimiter (ERO &).

Note also that ERO is only used for the general entity reference production.