Fun with encodings in MySQL

Since my post timezones in MySQL turned out to be so useful (I keep checking it out every other month), I thought it would be time well spent if I jotted down some notes about another area that sends me googling every time I run into it: Encodings in MySQL.

[One] Text columns in MySQL are annotated with the encoding their data is supposed to be in. If a column doesn’t specify an encoding, a default can be given on the table, database and server levels.

MySQL will use it’s knowledge about the encoding of a column to make sure that data getting in an out is properly transcoded to whatever encoding the client is using. This is determined by variables:

  • character_set_results determines the encoding used for the bytes sent to the client.
  • character_set_client let’s the server know which encoding the bytes use that it receives from the client.
  • There’s also character_set_connection, which is described in the documentation as the encoding the server translates an incoming statement to. I’m not sure though why MySQL can’t just directly convert from the client charset to the charset used by the relevant columns, which it ultimately will do anyway. I can only imagine that this setting might be useful with respect to binary columns.

So to recap the process, the MySQL client sends a sequence of bytes to the server, the server considers those bytes to use whatever “character_set_client” is set to, will convert the data to the encoding that the target column declares to use, and will then store the result again as an encoded byte string.

[Two] To change the encoding of a complete table, you can use:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Or modify a single column only:

ALTER TABLE tbl_name MODIFY column1 CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Note that in both cases, MySQL will automatically convert the data stored from the old to the new encoding. This is usually what you want, except when you have one particular problem:

[Three] The data actually stored in the database uses a different encoding than the one that is declared in the column meta data, or worse, the stored data has no valid encoding at all.

You see, it’s quite easy to get MySQL to invalidly encoded data. For example, if you declare a column as “utf8”, set the character_set_client setting to “latin1”, and then send data in “utf8”, MySQL will apply a Latin1ToUtf8() transcoding function to the utf8 source data before storing it. So effectively, your text has now been encoded in utf8 twice.

Since the default for charset_set_client is usually latin1, all you have to do is pipe an UTF8-encoded SQL dump (one that doesn’t set the proper charset variables) into the mysql command line client, and you have a mess.

On the bright side, you can fix those issues directly through MySQL as well. If you have a simple encoding mismatch, it may be enough to simply change the charset declaration of the columns to match the encoding the data actually uses:

ALTER TABLE tbl_name MODIFY column1 VARCHAR(100) CHARACTER SET binary;
ALTER TABLE tbl_name MODIFY column1 VARCHAR(100) CHARACTER SET utf8;

If you naively apply a ALTER TABLE with the wanted charset, MySQL will automatically transcode the data based on what it incorrectly thinks the column’s charset is. So the trick here is to use two steps, and convert to “binary” first, which essentially amounts to “no encoding”. As a result, MySQL won’t touch the data. It simply drops the encoding annotation in the first statement, and sets a new one in the second.

If the data in the table is actually incorrectly encoded, you can fix this using CONVERT().

First, you may want to investigate what actually is wrong, i.e. what data exactly is stored as opposed to what you like to see stored. You can determine the actual bytes, avoiding any charset conversion by MySQL, using:

SELECT HEX(column1) FROM tbl_name;

Now take for example the case above. Say the column is declared as UTF8, but the UTF8 data we sent was incorrectly passed through a Latin1ToUtf8() conversion. The following statement will then reverse the effect:

UPDATE tbl_name SET
    column1=CONVERT(CONVERT(CONVERT(column1 USING latin1) USING binary) USING utf8)

CONVERT transcodes the text from whatever encoding MySQL thinks the data is currently in to the encoding given by USING. We use the same binary-trick as before. MySQL thinks the data in column1 is UTF8, so the innermost CONVERT will apply a Utf8ToLatin1() transcoding, reversing the Latin1ToUtf8() function that should never have happened in the first place. However, MySQL now thinks the result is in latin1. If we were to just save that into an UTF8 column, it would be converted back right away. So we first drop the charset annotation by switching to binary, and then we set the charset to utf8, which should now match what the data actually contains. If you wonder whether that last step can be omitted – yes, I believe so. We could just write the data returned by the second CONVERT call directly, it should have the same effect.

Update: Here is an alternative approach that does the same thing:

ALTER TABLE block MODIFY data TEXT CHARACTER SET latin1;
ALTER TABLE block MODIFY data TEXT CHARACTER SET binary;  
ALTER TABLE block MODIFY data TEXT CHARACTER SET utf8;

We convert each column individually. ALTER TABLE block CONVERT TO CHARACTER SET would be nice, but it converts text columns to things like varbinary, but then not back.

The MySQL documentation also has a bunch of info an charsets and converting.

2 thoughts on “Fun with encodings in MySQL

  1. This was an absolute LIFE SAVER – THANKS!
    I spent a sleepless night, and a frustrating morning trying to figure out how to convert an old ISO-8859-8 database to UTF-8. I kept getting what seemed to be double-encoded UTF8 data.

    Your nested convert query really saved the day!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s