how do I find non UTF-8 characters in mysql (mariadb) data using mysql or PHP

378 views Asked by At

I am using

mysql Ver 15.1 Distrib 10.3.39-MariaDB, for Linux (x86_64) using readline 5.1

PHP 8.2.13 (cli) (built: Nov 24 2023 09:33:30) (NTS)

I would like to use either or both of these to find data within mysql where the mysql data columns have been declared to be in character set utf8mb4, collation utf8mb4_unicode_ci, but where the characters in the data are not actually valid UTF-8 characters.

I'm pretty sure that I have Mojobake, latin1, or other non-valid characters in my data. My data comes from governmental databases that do not really screen what reporters to the database enter and do not have a consistent character encoding. Some of the characters certainly look like garbage characters.

However, I've tried various ways of automatically detecting these, and none of them seem to work.

I've tried within mysql (I'm pretty sure that all of the defaults e.g. SET NAMES are set correctly):

select field_name from table_name
WHERE CONVERT(field_name, binary) RLIKE '([\\xC0-\\xC1]|[\\xF5-\\xFF]|\\xE0[\\x80-\\x9F]|\\xF0[\\x80-\\x8F]|[\\xC2-\\xDF](?![\\x80-\\xBF])|[\\xE0-\\xEF](?![\\x80-\\xBF]{2})|[\\xF0-\\xF4](?![\\x80-\\xBF]{3})|(?<=[\\x00-\\x7F\\xF5-\\xFF])[\\x80-\\xBF]|(?<![\\xC2-\\xDF]|[\\xE0-\\xEF]|[\\xE0-\\xEF][\\x80-\\xBF]|[\\xF0-\\xF4]|[\\xF0-\\xF4][\\x80-\\xBF]|[\\xF0-\\xF4][\\x80-\\xBF]{2})[\\x80-\\xBF]|(?<=[\\xE0-\\xEF])[\\x80-\\xBF](?![\\x80-\\xBF])|(?<=[\\xF0-\\xF4])[\\x80-\\xBF](?![\\x80-\\xBF]{2})|(?<=[\\xF0-\\xF4][\\x80-\\xBF])[\\x80-\\xBF](?![\\x80-\\xBF]))'

I've tried within PHP retrieving the mysql data into $string (I'm pretty sure that I've set all the PHP defaults to use UTF-8) and then:

1.

if (preg_match("//u", $string)) {
  // $string is valid UTF-8
}
if (mb_detect_encoding($string,"UTF-8",TRUE)) {
  // $string is valid UTF-8
}

But none of these methods detect the characters that look wrong.

Sample data

 DROP TABLE IF EXISTS sample; 
 CREATE TABLE sample ( data text DEFAULT NULL ) 
 ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; 
 INSERT INTO sample VALUES (
 'Mr. Geoffrey H. Yost, O�Melveny & Myers LLP')
 ,('Valero Refining Company � California')
 ,('El Paso Merchant Energy � Petroleum Company')
 ,('Papé Group/IFCO')
 ,('Huntons� SureCrop Farm Svc Inc dba SureCrop Farm')
 ,('developed, manufactured, marketed, tested, and sold the electronic diesel control that allowed Mercedes to manipulate emissions controls and that Bosch marketed ¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬Ãƒ€ ‚¬„¢‚¬Å¡¬¦¢€š¬Ã…¡¬ÃƒÆ’‚¬¦€š¬Ã…€œClean Diesel¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬€š¬Ã…¡<9d> to the public.'); 
4

There are 4 answers

1
Georg Richter On

The original data was stored in utf8 or utf8mb4 but incorrectly decoded using an 8-bit encoding, like latin1.

'�' was unicode replacement character which usually indicates that some other character could not correctly be converted before.

You can easily check this with e.g. python

>>> 'é'.encode('raw_unicode_escape').decode('utf8')
'é'
>>> '�'.encode('raw_unicode_escape').decode('utf8')
'�'

Currently the utf8mb4 character set contains 1,112,064 valid characters, but an 8-bit character set only contains 255 characters. This means that your data theoretically might have 1.111.809 wrong multibyte characters, which weren't converted correctly.

2
Luuk On

I did load your sample data, after that I did, base on this answer:

-- Because I do not like selecting text with weird characters:
alter table sample add id INT AUTO_INCREMENT PRIMARY KEY;

select 
   id, 
   data, 
   convert(cast(convert(data using  latin1) as binary) using utf8mb4) L 
from sample 
where id=4;

This produces:

id data L
4 Papé Group/IFCO Papé Group/IFCO

It seems that the import you did from the external databases incorrectly imported LATIN1 as UTF8.

Your statement: "... governmental databases that do not really screen what reporters to the database enter ..." might be incorrect, and it might be that you simply imported it using wrong assumtpions.

A solution, to find those errors, might be:

select 
  data d1,
  convert(cast(convert(data using  latin1) as binary) using utf8mb4) d2 
from sample 
where data<>convert(cast(convert(data using  latin1) as binary) using utf8mb4);

But this leads to the error:

ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,IMPLICIT) for operation '<>'

So, in stead of correcting the error, you might want to restart, and do the import in a correct way?

Also try this:

select id, data, convert(cast(convert(data using  latin1) as binary) using utf8mb4) as d2 
from sample 
where data collate utf8mb4_0900_ai_ci != convert(cast(convert(data using  latin1) as binary) using utf8mb4) and id between 5 and 5;

select id, data, convert(cast(convert(data using  latin1) as binary) using utf8mb4) as d2 
from sample 
where data collate utf8mb4_0900_ai_ci != convert(cast(convert(data using  latin1) as binary) using utf8mb4) and id >= 5;

The first statement does not throw a warning, but the second statement does. The warning is: Invalid utf8mb4 character string: 'A2809A'

The difference between those 2 last statement, and the warning give, might lead to an almost not possible way of finding this wrong characters.

I created a DBFIDDLE with the statements used.

0
Rich P On

Thanks to Georg Richter, Chris Haas, and other contributors for clarifying this. I'm going to try answering my own question.

It's possible to have an invalid UTF-8 sequence -- from the Unicode FAQ: "A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER)."

However, the incoming data that I get already has been "converted" into UTF-8 by being run through the process above. Therefore, all of the characters are valid UTF-8 characters. The three methods that I presented in my question might work to detect invalid UTF-8 characters, but they don't work on these data because everything is already a valid UTF-8 character.

To find improperly converted characters, I can use mysql to find character variables where length(field_name) <> char_length(field_name) (this should detect any multibyte character) and then test the multibyte characters for the replacement character or any of a number of other characters copied from an ad hoc list that I create of characters that I never expect to see in well formed data. For instance, "ÃÆ" is apparently a valid UTF-8 character (or more than one) but I don't expect to ever see it used for anything. I can keep adding these characters to a table somewhere until my detector works well enough.

4
Rick James On
  • � is hex EFBFBD, which is BOM, which I discuss here

    Some editors and other text creation tools start a document with the BOM to indicate that it is UTF-8 encoded. It looks like you have constructed data from the first part of a document, and that is where that string comes from.

  • Perhaps a global REGEXP_REPLACE(col, '�', '') would be part of an UPDATE to get rid of that. (Caution: it will take a long time if the table is large.)

  • This will find "names" encoded in UTF-8:

    WHERE HEX(name) REGEXP '^(..)*[CDEFcdef]'
    
  • This will find those that need utf8mb4, not just utf8mb3:

    WHERE HEX(name) REGEXP '^(..)*[Ff]'
    
  • é is Mojibake for é. If you see that, diagnose it with Trouble with UTF-8 characters; what I see is not what I stored

  • "Character set" refers to the encoding; "Collation" refers to sorting. Your question is only about CHARACTER SET.

  • SET NAMES latin1; immediately after connecting to MySQL/MariaDB server will announce that the characters in the client (eg, your govt db) are encoded as latin1. Possibly you are running with a utf8 default.

  • If a single dataset has a mixture of encodings, all hope is lost!

  • ¬¦¡€š¬Ã…¡¬ may be "double-encoding" (see this StackOverflow link).

  • Some bad text can be "fixed", but only after identifying what caused it to be wrong. If the data came from multiple sources, there may be multiple different fixes needed. Furthermore, applying the wrong fix will only make the data much harder to fix.

  • May I suggest that you start over on that terabyte. Load each source into a separate table, then verify the encoding. If it is broken, see if a SET NAMES will fix it (and re-load). This will let you isolate the different fixes. Luuk pointed out one CONVERT(...) that works for one case. I do not have an answer for A2809A; it is not UTF-8, nor does it look like valid stuff in any of the supported charsets.