Is HtmlParser.entityref actually a valid regex for matching html entity references?

Question

Is HtmlParser.entityref actually a valid regex for matching html entity references?

124 views Asked by n611x007 At 20 November 2014 at 15:33

This is the code from Python 2.7 HtmlParser:

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')

Previously, I assumed it to be more like this:

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);')

so it caught me by surprise on some strange data from some strange source.

My use-case is irrelevant; is there any reason to define the entity reference like HtmlParser?

irrelevant use-case: Should anyone wonder, I describe my use-case nevertheless. Please note that I am not trying to solve my use-case anymore. My question is whether HtmlParser's entityref is buggy.

My use-case is similar to this: Strip HTML from strings in Python

The input data I was speaking about is like this:

r'''<foo bar="blah"> asda&Il_'d@m_'<foo rab="halb">'''

The intended output from my use-case would have been r"""a&Il_'d@m_'""".

edit I was trying to compare the regex to this sgml reference and in my understanding the entity reference should end with ; but I'm not that familiar with the topic, so I wanted to ask.

Original Q&A

There are 1 answers

**Javier** · Accepted Answer · 2014-11-27T23:14:11+00:00

The syntactic production for reference end reads:

[61] reference end =
  ( refc | ;
    RE ) ? (13) CR

That means that the following are recognized as reference ends:

A REFerence Close delimiter (; in the reference syntax), as you expected
A Record End
Nothing (note the use of the ? metacharacter after the close parenthesis, meaning that both REFC and RE are optional)

If nothing is used as a reference end, the reference ends at the first non-name character after the name start character, as required by the rules of the reference recognition mode that has been entered at the Entity Reference Open delimiter (ERO &).

Note also that ERO is only used for the general entity reference production.

TechQA.

Is HtmlParser.entityref actually a valid regex for matching html entity references?

There are 1 answers

Related Questions in PYTHON

Related Questions in HTML

Related Questions in SGML

Popular Questions

Popular Tags

Trending Questions