I have a very large JSON file. Most of it is valid JSON data, but parts of it are not. The following is a simplification of my case:
[
"this is valid: \ud835\udc47",
"this is invalid: \ud835",
]
The first item is valid and will be successfully parsed, but when the second item is attempted the deserialization will fail because UTF-8 doesn't allow the \ud835
character at all while UTF-16 doesn't allow a lone \ud835
character as it needs to be followed by another hex escape.
This issue has occurred when using a HTTP server that uses Python's built-in JSON deserializer and saved the data to a database. Python's deserializer accepted a lone "\ud835" character which is not valid UTF-8 or UTF-16. Now when we want to migrate this application and database to Rust with serde it catches this invalid UTF-8/16 string.