choosing character sets and collations for combined Latin / Cyrillic language data

1.5k views Asked by At

How should I configure a MySQL DB in phpmyadmin for storing both latin and cyrillic data sets in the same table, for a multi-language application?

1

There are 1 answers

0
O. Jones On BEST ANSWER

When you create your database, you can choose a default...

  • Character set to define how your characters are stored.
  • Collation to define how your characters are sorted and searched.

You give a command like this:

 CREATE DATABASE mydata CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci

phpMyAdmin has a dialog box that prompts you for those values.

(MySQL loves to brag about its Swedish roots by setting its serverwide defaults to Latin1 character sets and Swedish collation. So be aware you might have to override the defaults. If I were Swedish I would brag too.)

Then, you can, if you wish, override those choices for each table or even for each column of a table.

The character set is the most important of these choices, because the data you put into tables will be represented in that character set. If your application is a new start, you should pick the character set utf8mb4. In any case you should pick a Unicode character set like utf8. Unicode is capable of representing almost all known natural languages with a single character set, including English, Spanish, Cyrillic, Magyar, Hebrew, Turkish, Greek, Arabic, and Eastern languages. See here for a description of the various character sets.

https://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html

The collation governs how text is sorted and searched. MySQL offers many case-insensitive collations. This is really cool for natural language text, because it makes search work better.

You should pick utf8mb4_unicode_ci for a new start, or utf8_unicode_ci. That should serve you well unless you have very specific linguistic details to deal with. (Spanish, for example, handles Ñ as a separate letter rather than a case-variant of N. To get that right you need to use the utf8mb4_spanish_ci or utf8_spanish_ci collation.)