Please help me understand how multibyte characters like emoji's are handled in MySQL utf8mb4 fields.
See below for a simple test SQL to illustrate the challenges.
/* Clear Previous Test */
DROP TABLE IF EXISTS `emoji_test`;
DROP TABLE IF EXISTS `emoji_test_with_unique_key`;
/* Build Schema */
CREATE TABLE `emoji_test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
`status` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `emoji_test_with_unique_key` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
`status` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_string_status` (`string`,`status`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
/* INSERT data */
# Expected Result is successful insert for each of these.
# However some fail. See comments.
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1); # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1); # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1); # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1); # FAIL: Duplicate entry '?-1' for key 'idx_string_status'
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('', 1); # FAIL: Duplicate entry '??-1' for key 'idx_string_status'
/* Test data */
/* Simple Table */
SELECT * FROM emoji_test WHERE `string` IN ('','','',''); # SUCCESS (all 4 are found)
SELECT * FROM emoji_test WHERE `string` IN (''); # FAIL: Returns both and
SELECT * FROM emoji_test WHERE `string` IN (''); # FAIL: Returns both and
SELECT * FROM emoji_test; # SUCCESS (all 4 are found)
/* Table with Unique Key */
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('','','',''); # FAIL: Only 2 are found (due to insert errors above)
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN (''); # SUCCESS
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN (''); # FAIL: found instead of
SELECT * FROM emoji_test_with_unique_key; # FAIL: Only 2 records found ( and )
I'm interested in learning what causes the FAIL
s above and how I can get around this.
Specifically:
- Why do selects for one multibyte character return results for any multibyte character?
- How can I configure an index to handle multibyte characters instead of
?
? - Can you recommend changes to the second
CREATE TABLE
(the one with a unique key) above in such a way that makes all the test queries return successfully?
You use
utf8mb4_unicode_ci
for your columns, so the check is case insensitive. If you useutf8mb4_bin
instead, then the emoji and are correctly identified as different letters.With
WEIGHT_STRING
you can get the values that are use for sorting and comparison for the input string.If you write:
Then you can see that both are
0xfffd
. In Unicode Character Sets they say:If you write:
You will get their unicode values
0x01f32e
and0x01f336
instead.For other letters like
Ä
,Á
andA
that are equal if you useutf8mb4_unicode_ci
, the difference can be seen in:Those map to to the weight
0x0E33
According to : Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL? the weights used for
utf8mb4_unicode_ci
are based on UCA 4.0.0 because the emoji do not appear in there, the mapped weight is0xfffd
If you need case insensitive compares and sorts for regular letters along with emoji then this problem is solved using
utf8mb4_unicode_520_ci
:there will also get different weights for those emoji
0xfbc3f32e
and0xfbc3f336
.