Best Way to Debug html5lib.html5parser.ParseError: Unexpected character after attribute value"?

57 views Asked by At

I am currently working on a personal project and utilizing the chessdotcom Public API Package. I am currently able to store in a variable the PGN from the daily puzzle (Portable Game Notation) which is a required input to create a chess gif (https://www.chess.com/gifs).

I wanted to use requests and html parsers to essentially fill out the form on the gifs site and create a gif through my python script. I made a request to the gif website and the response.text returns a huge html string (thousands of lines) which I am parsing using html5lib. I am currently getting a "html5lib.html5parser.ParseError: Unexpected character after attribute value." I can't seem to figure out where in this giant response the issue is. What are some tips/tricks to debug this issue? Where do I even begin looking for this unexpected character?

import requests as req
import html5lib
from datetime import datetime
from chessdotcom import Client, get_player_profile, get_player_game_archives,get_player_stats, get_current_daily_puzzle, get_player_games_by_month


Client.request_config['headers']['User-Agent'] = 'PyChess Program for Automated YouTube Creation'


class ChessData:
    def __init__(self, name):
        self.player = get_player_profile(name)
        self.archives = get_player_game_archives(name)
        self.stats = get_player_stats(name)
        self.games = get_player_games_by_month(name, datetime.now().year, datetime.now().month)
        self.puzzle = get_current_daily_puzzle()
        self.html_parser = html5lib.HTMLParser(strict=True, namespaceHTMLElements=True, debug=True)

    def organize_puzzles(self, puzzles):
        #dict_keys(['title', 'url', 'publish_time', 'fen', 'pgn', 'image'])
        portableGameNotation = puzzles['pgn']
        html_data = req.get('https://www.chess.com/gifs')
        print(html_data.text)
        self.html_parser.parse(html_data.text.replace('&', '&'))

    def get_puzzles(self):
        self.organize_puzzles(self.puzzle.json['puzzle'])

I had initially had issues with "Name Entity Expected. Got None" error which I temporarily bypassed by replacing all instances of & with & entity.


Traceback (most recent call last):
  File "C:/ChessProgram/ChessTop.py", line 17, in <module>
    main()
  File "C:/ChessProgram/ChessTop.py", line 14, in main
    ChessResults.get_puzzles()
  File "C:\ChessProgram\ChessData.py", line 32, in get_puzzles
    self.organize_puzzles(self.puzzle.json['puzzle'])
  File "C:\ChessProgram\ChessData.py", line 29, in organize_puzzles
    self.html_parser.parse(html_data.text.replace('&', '&amp;'))
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 284, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 133, in _parse
    self.mainLoop()
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 216, in mainLoop
    self.parseError(new_token["data"], new_token.get("datavars", {}))
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 321, in parseError
    raise ParseError(E[errorcode] % datavars)
html5lib.html5parser.ParseError: Unexpected character after attribute value.

I tried replacing the & with &amp; to fix the entity name issue and manually searched through this html response for the different attributes and looking for anything out of place.

1

There are 1 answers

3
furas On BEST ANSWER

Normally to debug html I would try to split HTML to smaller elements and test it. But with html5lib it may be problem because it may need full HTML to parse it. So it may need to write own functions in parser to display more information during parsing.

But if you use html5lib.HTMLParser() without parameters (or with stricte=False) then it runs correctly even without .replace('&', '&amp;')

But still I wouldn't use html5lib for this because I don't see any functions to search elements in HTML. It may need to write own functions.

It is much simpler to do it with BeautifulSoup or lxml (or other modules)


Other problem: page uses cookies and it has hidden input with token which it probably compares with cookies (to generate image) and this needs `requests.Session()

So I do

  • create requests.Session()
  • use session to get() page with form
  • use BeautifulSoup to search hidden input with token
  • use session to post() all data like real form
  • use standard text.find() to find url to animated gif
    (it has unique address - so it is easy to find it without BeautifulSoup)
  • use session to get() aniamted gif and write it in local file
    (it needs .content instead of .text to work with bytes instead of string)
  • (optional) use webbrowser to display url with animated gif in default browser
    (if local image viewer has problem to display animated gifs)

Full working code:

#import requests 
from requests import Session
from bs4 import BeautifulSoup 

#headers = {
#    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0' 
#}    

s = Session()
#s.headers.update(headers)

url = 'https://www.chess.com/gifs'

# --- get token ---

response = s.get(url)
html = response.text

#soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, 'html5lib')

item = soup.find('input', {'id': 'animated_gif__token'})
#print(item)

token = item['value']
print('token:', token)

# --- send form, get response and search image ---

game =  "https://www.chess.com/live/game/3048628857"

payload = {
    "animated_gif[data]": game,
    "animated_gif[board_texture]": "green", # "brown",
    "animated_gif[piece_theme]": "neo",
    "animated_gif[_token]": token
}

response = s.post(url, data=payload)
html = response.text

start = html.find('https://images.chesscomfiles.com/uploads/game-gifs/')
end   = html.find('"', start)

image_url = html[start:end]

print(image_url)

# --- download file ---

response = s.get(image_url)

# write using `bytes` instead of `text`
with open('animation.gif', 'wb') as f:
    f.write(response.content)

# --- show image_url in browser ---

import webbrowser

webbrowser.open(image_url)