How to debug invalid data in MySQL utf8mb4 column in Etherpad Lite database

563 views Asked by At

We're running Etherpad Lite and we're trying to migrate database from MySQL to PostgreSQL.

MySQL database 'value' column is of type utf8mb4. However, around 10% of all rows contain value that is in fact encoded in Windows-1252 or ISO-8859-15 instead of UTF-8. How is this possible? Does not MySQL validate the UTF-8 before entering it into the column?

PostgreSQL cannot accept the invalid values during migration because it does validate the data and hits e.g. raw byte 0xE4 (ISO-8859-15: ä) which should be encoded as byte sequence 0xC3 0xA4 in UTF-8.

Is this a known "feature" of MySQL? Is there any way to always get real UTF-8 from utf8mb4 column?

2

There are 2 answers

0
Mikko Rantalainen On BEST ANSWER

No solution is known. This is probably a bug in MySQL which should disallow storing non-UTF-8 data in case client connnection and column type are both utf8mb4.

I no longer use MySQL for anything so I haven't bothered to try to figure this bug any more. Nowadays, I'm using PostgreSQL for everything instead.

5
Rick James On

If

  • you say the client is using latin1 (etc), and
  • you say the column is utf8 (or utf8mb4), and
  • you provide hex E4

Then all is well. The E4 will be converted during the INSERT into C3A4 and that is what is stored. Do SELECT HEX(...) ... to verify.

If

  • you say the client is using utf8 (or utf8mb4), and
  • you say the column is utf8 (or utf8mb4), and
  • you provide hex C3A4

Again, all is well. The C3A4 goes directly into the table.

Here's a messy case:

If

  • you say the client is using latin1, and
  • you say the column is utf8 (or utf8mb4), and
  • but you provide hex C3A4

Then, MySQL is obligated to convert two characters (C3 and A4) into utf8, yielding C383C2A4. I call this "double encoding".

Follow the Best Practice in Trouble with UTF-8 characters; what I see is not what I stored and use its suggested way to test the data. Then come back with more details.

Probably the only way for 10% of the data to be mis-interpreted is for 10% of the data to be encoded differently. So, please provide hex for a 10% example and for a 90% example. And provide the hex in the client before inserting and in the table after it is inserted.