MySQL utf8 vs. utf8mb4 – What’s the difference? (eversql.com)

 

MySQL has caused so much pain with their decision to make latin1 the default encoding.I’ve seen so many PHP apps that store UTF-8 encoded text in Latin1 fields, and it kind of works, until you correctly configure your client library, or try sorting the data, or check for equality…

The problem is so common, that the popular MySQL client app Sequel Pro even has an option to set the encoding to “UTF8 via Latin1”, which tells the server to output Latin1 but then treats the result as UTF-8.

So once you learned this the hard way, you’ll change the db encoding to utf8, and you’ll be glad to have finally fixed the encoding mess, only to stumble across the fact from the original article that utf8 is not really utf8 in MySQL — oh the horror!

Use utf8mb4. Use utf8mb4_unicode_520_ci. Alter tables usingALTER TABLE database_name.table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci.

Verify that all of your columns are utf8mb4 with:

SELECT table_schema, table_name, character_set_name, collation_name FROM information_schema.columns WHERE table_schema NOT IN (‘mysql’,’information_schema’,’sys’,’performance_schema’) AND character_set_name is not NULL AND character_set_name != ‘utf8mb4’;

You may have to update your client libraries (like Perl-DBD) if you’re still using something like CentOS 6.

If only they had named them utf8mb3 and utf8. Would have been a lot clearer that one was optimized, but not fully utf8.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

How UTF-8 works

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops.

.. UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed.

.. If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re reallygood, a box. Which did you get? -> �

.. There are hundreds of traditional encodings which can only store somecode points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252

..  ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks.

..  It does not make sense to have a string without knowing what encoding it uses.

.. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

.. they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.