Migrating MySQL Data to Unicode

Two and half years ago when I first starting working here, I quickly realized that one of the issues we would face in the future, would be lack of Unicode support; however, all of our clients are state-side and the “nothing exists outside of the US/English” mentality was working fine for us.

Two and a half years later, we now have several world-wide clients, and the time has finally come for us to support unicode; our biggest hurdle, is the 7 years of legacy Latin1 (ISO-8859-1) data.

Well, now I had to bite the bullet, I was thinking to write a script to change the table to UTF-8, and be done… but it’s not that simple. MySQL does NOT convert the data when you change the table charset; so then I decided to grab all the data from the database, throw it through iconv() and back into the database.

Unfortunately, with this approach, we were looking at some serious downtime, something I’d like to avoid. So when discussing this in #phpc, abraxas informed me that if you create a second database, with the same schema but using UTF-8, and then do an “INSERT… SELECT” between the two, MySQL would convert the data for you.

This is all well and good, but we can’t change the database name, and as there is no (to my knowledge) easy way to rename a database using SQL, I decided to take it to the table level. So by following the procedure below, we can convert our data to Unicode without any issues:

  • Grab the databases
  • Check to make sure it’s an application database
  • Grab the tables
  • Grab the table schema
  • Create a duplicate table, with _utf8 tacked on the name, using unicode charset
  • INSERT … SELECT the data
  • Delete the old table
  • Rename the new one
  • Rinse and repeat

The following is a rough and ready script that I whipped up in like… 2 minutes, and at 69 lines will convert all database to unicode

Other than this, just run the following on every DB connection:

and add the following to your Apache conf:

Other than this, there are some minor areas we need to tidy up (for example, queries containing “Å” fetch “A”, but not vice-versa) but at least we have no issues supporting input and output of all unicode data and it took less than a few hours to complete.