I am using
mysql Ver 15.1 Distrib 10.3.39-MariaDB, for Linux (x86_64) using readline 5.1
PHP 8.2.13 (cli) (built: Nov 24 2023 09:33:30) (NTS)
I would like to use either or both of these to find data within mysql where the mysql data columns have been declared to be in character set utf8mb4, collation utf8mb4_unicode_ci, but where the characters in the data are not actually valid UTF-8 characters.
I'm pretty sure that I have Mojobake, latin1, or other non-valid characters in my data. My data comes from governmental databases that do not really screen what reporters to the database enter and do not have a consistent character encoding. Some of the characters certainly look like garbage characters.
However, I've tried various ways of automatically detecting these, and none of them seem to work.
I've tried within mysql (I'm pretty sure that all of the defaults e.g. SET NAMES are set correctly):
select field_name from table_name
WHERE CONVERT(field_name, binary) RLIKE '([\\xC0-\\xC1]|[\\xF5-\\xFF]|\\xE0[\\x80-\\x9F]|\\xF0[\\x80-\\x8F]|[\\xC2-\\xDF](?![\\x80-\\xBF])|[\\xE0-\\xEF](?![\\x80-\\xBF]{2})|[\\xF0-\\xF4](?![\\x80-\\xBF]{3})|(?<=[\\x00-\\x7F\\xF5-\\xFF])[\\x80-\\xBF]|(?<![\\xC2-\\xDF]|[\\xE0-\\xEF]|[\\xE0-\\xEF][\\x80-\\xBF]|[\\xF0-\\xF4]|[\\xF0-\\xF4][\\x80-\\xBF]|[\\xF0-\\xF4][\\x80-\\xBF]{2})[\\x80-\\xBF]|(?<=[\\xE0-\\xEF])[\\x80-\\xBF](?![\\x80-\\xBF])|(?<=[\\xF0-\\xF4])[\\x80-\\xBF](?![\\x80-\\xBF]{2})|(?<=[\\xF0-\\xF4][\\x80-\\xBF])[\\x80-\\xBF](?![\\x80-\\xBF]))'
I've tried within PHP retrieving the mysql data into $string (I'm pretty sure that I've set all the PHP defaults to use UTF-8) and then:
1.
if (preg_match("//u", $string)) {
// $string is valid UTF-8
}
if (mb_detect_encoding($string,"UTF-8",TRUE)) {
// $string is valid UTF-8
}
But none of these methods detect the characters that look wrong.
Sample data
DROP TABLE IF EXISTS sample;
CREATE TABLE sample ( data text DEFAULT NULL )
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
INSERT INTO sample VALUES (
'Mr. Geoffrey H. Yost, O�Melveny & Myers LLP')
,('Valero Refining Company � California')
,('El Paso Merchant Energy � Petroleum Company')
,('Papé Group/IFCO')
,('Huntons� SureCrop Farm Svc Inc dba SureCrop Farm')
,('developed, manufactured, marketed, tested, and sold the electronic diesel control that allowed Mercedes to manipulate emissions controls and that Bosch marketed ¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬Ãƒ€ ‚¬„¢‚¬Å¡¬¦¢€š¬Ã…¡¬ÃƒÆ’‚¬¦€š¬Ã…€œClean Diesel¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬€š¬Ã…¡<9d> to the public.');
The original data was stored in utf8 or utf8mb4 but incorrectly decoded using an 8-bit encoding, like latin1.
'�'was unicode replacement character�which usually indicates that some other character could not correctly be converted before.You can easily check this with e.g. python
Currently the utf8mb4 character set contains 1,112,064 valid characters, but an 8-bit character set only contains 255 characters. This means that your data theoretically might have 1.111.809 wrong multibyte characters, which weren't converted correctly.