I have a big php object that I want to serialize and store in a MySql database. The table encoding is UTF-8
and the column to hold the serialized object encoding is also UTF-8
.
The problem is the object holds a text string containing French characters.
For example:
Merci d'avoir passé commande avec Lovre. Voici le récapitulatif de votre commande
When I serialize the object then unserialize it again directly the string is maintained and is in correct format.
However, when I store the serialized object into a MySql database then retrieve it again then unserialize it the string becomes like this:
Merci d'avoir passé commande avec Lovre. Voici le récapitulatif de votre commande
Something goes wrong when I store the object in the database.
Notes:
- The object is stored using propel ORM.
- The column type is
text
. - The string is stored and read from a html file.
The strings created by
serialize
are binary strings, they don't have a specific charset encoding but are just an "array" of bytes (where-as one byte is 8bit, an octet).If you now take such a string and tell your database that it is LATIN-1 encoded and your database stores it into a text-field with UTF-8 encoding, the database will transparently change the encoding from LATIN-1 into UTF-8. UTF-8 is a charset encoding that uses more than one byte per character for some characters, for example those you give in your question like
é
.The character
é
is then stored asé
inside the database, which is the UTF-8 byte-sequence foré
.If you now fetch the data from the database without specifying in which encoding you need it, the database will return it as UTF-8.
Now
unserialize
has a problem because the binary string has been modfied in a way which makes it invalid.Instead you need to either tell your database that it should not modify the encoding when it stores the serialized string, e.g. by choosing the right column type and encoding (binary field, BLOB - Binary Large ObjectMySQL Docs, see as well Binary TypesPropel Docs) -or- when you fetch the data from the database you revert the charset-encoding back to the original format. The first approach (binary field) is better because it is exactly what you're looking for.
For the data that has been already stored into the database in a wrong format, you need to correct the data. To do that you first need to find out which re-encoding was applied, e.g. from which charset to which charset. I assume it's LATIN-1 but there is no guarantee. You need to review the encoding of your current application data and processes to find out.
After you've found out, encode the values back from UTF-8 to the original encoding.