Blog

Legacy databases and Special Characters

Scott's picture

Scott

Nov 18, 2009

I can't count the number of times that I have had to deal with the "special character" problem. You know the problem--you're reading along on some Web site, perhaps even a Web site that you are managing, and you find words with weird characters--maybe Euro signs, accented upper and lower case characters and whatnot--interspersed with the normal text. Or perhaps the whole page is simply gibberish. Or maybe there is just a stray question mark in the middle of words here and there.

These are symptoms of an encoding problem. Sometimes the problem is as simple as a server serving up a page in one encoding as if it were in another encoding. But sometimes the problem has to do with how content is stored in or gotten out of a database. It is this problem that I want to look at here.

It wasn't that long ago that mysql, one of the most popular databases among open-source aficionados, did not support UTF-8, the encoding of choice for internationalization. To work around this, many coders simply declared the encoding of the database to be ISO-8859-1 (or didn't declare it at all: ISO-8859-1 is the default) and tried to handle encoding and decoding of data in their interface code. This sometimes worked passably well (though SQL's character string-handling functions--finding string lengths, sorting, pattern matching--won't work quite correctly). But it is hard to get this character-handling code right, and such programs often turn out to have subtle bugs. Worse yet, attempts to fix those bugs often end up leaving the data in an inconsistent state. For example, older entries in some table column may be encoded in one way, and more recent entries (made after a supposed bug fix) may be encoded another way.

To address this problem, I imagine a new database-analysis tool. (Perhaps such a tool already exists, but I wouldn't know how to find it if it does.) It would work something as follows: You would give it a database, a set of tables in a database, or maybe a set of textual fields in tables of a database. It would then analyze all of the textual fields given and try to figure out what encoding is used by each. In ambiguous cases, it might display a given value in various encodings and let the user figure out which is the correct one. If a single field seems to contain data that assumes different encodings, it would analyze the number of entries that assume one encoding versus another. (For example, it might tell the user that every row with an id > 2345 has a different encoding than the earlier entries.) Perhaps then it would even make recommendations about converting the existing data to a recommended encoding and describe the steps needed to do this, or even do it automatically upon request.

I have already done some initial experiments with writing such software. It seems that the most difficult part is not so much the mechanics of the code, but deciding exactly what the software should do. Any suggestions in this regard would be welcome. And if the software I am envisioning already exists, let me know that too.

Mojibake

Matt

Dec 09, 2009

I learned a couple things this week as I was trying to move data from a latin1-encoded database to a utf8-encoded database:

  1. There's a term for the garbage you see when characters aren't displayed with the proper encoding: mojibake
  2. There's no one bulletproof way to fix it

I finally found a solution that worked for me but I'd come very close to just giving up and correcting all the mojibake via search & replace. I started making a table of all the garbled strings in the database and the correct strings with which to replace them, but shouldn't I be able to create that list automatically? Won't an encoding mismatch mangle the text in predicable ways? Once I identify a piece of mojibake in my data, shouldn't I be able to tell what the proper data should be without the aid of a translation table?

I looked around for a pre-built solution on Google, but didn't come up with anything. I ended up fixing my specific problem by altering the encoding values on the SQL dump file. But this kind of thing comes up a lot, and a function that could fix mojibake on the fly is sure to come in handy.