I am currently working on a personal project and utilizing the chessdotcom Public API Package. I am currently able to store in a variable the PGN from the daily puzzle (Portable Game Notation) which is a required input to create a chess gif (https://www.chess.com/gifs).
I wanted to use requests and html parsers to essentially fill out the form on the gifs site and create a gif through my python script. I made a request to the gif website and the response.text returns a huge html string (thousands of lines) which I am parsing using html5lib. I am currently getting a "html5lib.html5parser.ParseError: Unexpected character after attribute value." I can't seem to figure out where in this giant response the issue is. What are some tips/tricks to debug this issue? Where do I even begin looking for this unexpected character?
import requests as req
import html5lib
from datetime import datetime
from chessdotcom import Client, get_player_profile, get_player_game_archives,get_player_stats, get_current_daily_puzzle, get_player_games_by_month
Client.request_config['headers']['User-Agent'] = 'PyChess Program for Automated YouTube Creation'
class ChessData:
def __init__(self, name):
self.player = get_player_profile(name)
self.archives = get_player_game_archives(name)
self.stats = get_player_stats(name)
self.games = get_player_games_by_month(name, datetime.now().year, datetime.now().month)
self.puzzle = get_current_daily_puzzle()
self.html_parser = html5lib.HTMLParser(strict=True, namespaceHTMLElements=True, debug=True)
def organize_puzzles(self, puzzles):
#dict_keys(['title', 'url', 'publish_time', 'fen', 'pgn', 'image'])
portableGameNotation = puzzles['pgn']
html_data = req.get('https://www.chess.com/gifs')
print(html_data.text)
self.html_parser.parse(html_data.text.replace('&', '&'))
def get_puzzles(self):
self.organize_puzzles(self.puzzle.json['puzzle'])
I had initially had issues with "Name Entity Expected. Got None" error which I temporarily bypassed by replacing all instances of &
with &
entity.
Traceback (most recent call last):
File "C:/ChessProgram/ChessTop.py", line 17, in <module>
main()
File "C:/ChessProgram/ChessTop.py", line 14, in main
ChessResults.get_puzzles()
File "C:\ChessProgram\ChessData.py", line 32, in get_puzzles
self.organize_puzzles(self.puzzle.json['puzzle'])
File "C:\ChessProgram\ChessData.py", line 29, in organize_puzzles
self.html_parser.parse(html_data.text.replace('&', '&'))
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 284, in parse
self._parse(stream, False, None, *args, **kwargs)
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 133, in _parse
self.mainLoop()
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 216, in mainLoop
self.parseError(new_token["data"], new_token.get("datavars", {}))
File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 321, in parseError
raise ParseError(E[errorcode] % datavars)
html5lib.html5parser.ParseError: Unexpected character after attribute value.
I tried replacing the &
with &
to fix the entity name issue and manually searched through this html response for the different attributes and looking for anything out of place.
Normally to debug html I would try to split HTML to smaller elements and test it. But with
html5lib
it may be problem because it may need full HTML to parse it. So it may need to write own functions in parser to display more information during parsing.But if you use
html5lib.HTMLParser()
without parameters (or withstricte=False
) then it runs correctly even without.replace('&', '&')
But still I wouldn't use
html5lib
for this because I don't see any functions to search elements in HTML. It may need to write own functions.It is much simpler to do it with
BeautifulSoup
orlxml
(or other modules)Other problem: page uses cookies and it has hidden
input
withtoken
which it probably compares with cookies (to generate image) and this needs `requests.Session()So I do
requests.Session()
get()
page withform
BeautifulSoup
to search hiddeninput
withtoken
post()
all data like realform
text.find()
to find url toanimated gif
(it has unique address - so it is easy to find it without
BeautifulSoup
)get()
aniamted gif
and write it in local file(it needs
.content
instead of.text
to work withbytes
instead ofstring
)webbrowser
to display url withanimated gif
in default browser(if local image viewer has problem to display animated gifs)
Full working code: