20 Oct 2003
Character Encoding Resolved

Following up on this morning's post .

It took a bit of experimentation to see just what concoction of encoding would flow all the way from iTunes to AppleScript to curl to Tomcat to MySQL and back and finally to the HTML text file for rendering in a browser.

The text extracted from iTunes in AppleScript starts in Mac Roman ASCII code points. This text has to be escaped and encoded in a couple of ways in order to flow the rest of the way correctly.

First, the obvious punctuation that can interfere with the shell and with URL construction have to be escaped in the percent+hex form. For example, & must be escaped as %26. These basic punctuation characters don't need a more complex escape sequence because they are in the core ASCII code set that does not conflict with anything in the chain.

Next, the characters above 128 need to be mapped to ISO 8859-1 values and escaped in the ampersand-pound-decimal-semicolon form. For example, ö must be escaped as ö. Actually, instead of the #246 decimal value, we could use the symbol's name ouml, yielding the escape sequence &#ouml;. But the code I found used the decimal values, so I didn't have to research the names of all those glyphs.

But the ISO escape sequences can't simply be of the form ö. The problem is that the ampersand symbol appears in the URL and will mess it up, such that the server will not get the intended parameters. So we must form the ISO escape sequences by hex encoding the punctuation inside them. So the ö actually gets represented as %26%23246; in order to be properly "buried" in the parameter list. Because the servlet engine knows how to decode these hex values, we get the desired character string ö placed in the database. Once stored correctly in the database, it ends up correct in the HTML.

Debugging the Text Pipeline

I need to debug a small but very visible problem that is hiding somewhere in the pipeline of different pieces of code that moves the iTunes song info into these web pages. I'll discuss the different pieces and what I find. Any insight would be appreciated, and you can send it to trygve (at) bombaydigital (dot) com.

The Problem

In the Now Playing section in the sidebar, the text for the song title, artist name, and album title are displayed. I noticed that the text was garbled for a couple of albums that got displayed, and as you might expect it was due to non-US-English characters. The visible examples were anything by Björk or Blue Öyster Cult. Obviously, somewhere in the text pipleline, the umlaut-accented o is getting mangled into the wrong thing. After a little digging around, I see that these particular characters, the lower and upper case o with umlaut, are represented in HTML as either ö and Ö, or ö and Ö.

The text starts out in iTunes in one of the text fields, artist name in this case. iTunes displays the character correctly.

An AppleScript gets the text object from the field in the AppleScript object model. I need to make sure that the text extracted from iTunes gets properly encoded by the script before it is sent to the next step. My guess is that this is source of the problem.

The AppleScript then invokes the Unix command curl, which makes a connection to the server to make an HTTP request that contains the text in the parameter list. I need to make sure that the text is properly encoded into the parameter list, and that the request is properly formed.

The Java servlet handler code on the server, that processes the HTTP request, gets the parameter from the request and then inserts text into the MySQL database in columns that are defined as strings. I need to make sure that the SQL commands format the text in a way that is compatible with SQL and MySQL in particular.

The servlet code then queries the database for recent songs. I need to make sure that the text that comes back out of the database in a correct form, or is suitably transformed.

Finally, the servlet code writes an HTML snippet file, which is pulled into the sidebar of various pages verbatim. I need to make sure that the snippet's copy of the text is correct formed for HTML.

Debugging

When I looked at my server logs, I could see clearly that the request parameters were not encoded correctly, or at least not in a way suitable for eventually appearing in the HTML page. The ö in Björk was sent as \xc3\xb6, and the Ö in Blue Öyster Cult was sent as \xc3\x96.

My first inclination was to make sure that the AppleScript was encoding the text in a reasonable way. The script that I originally found and tweaked only did a rudimentary escaping of characters to safely pass them to the shell. It was not really encoding anything beyond that. I found a few scripts on the Apple site that did some encoding stuff, but it looked to me like it just encoded stuff besides A-Z and 0-9 into percent-sign-escaped hex values. Was that enough to do the right thing? Probably not.

With the addition of the encoding provided by the function found on the Apple site, the ö was encoded as %9A, and the Ö was encoded as %85. Sure enough, these are the code points of these characters in the Mac Roman character set. You can see all the Mac Roman code points in the ASCII viewer window of my trusty utility application Hex Wrench. (It's a Classic app that runs perfectly OK in OS X under Classic.) Clearly this is not sufficient for exporting to the outside world in an HTTP request!

Mac Roman code points
(Mac Roman ASCII code points, grid courtesy of Hex Wrench)

But surprisingly, this worked in my initial test, and not just viewing it on Mac, but on Windows. The hex-encoded value actually worked better than the raw value, despite being the Mac Roman encoding. I tested viewing the page from browsers on both platforms. Viewing the HTML snippet, what appeared there was in fact not encoded, but simply the desired character. My guess was that this was just pure dumb luck: it's writing the Mac Roman byte value for the character, and the server is running on the Mac and serving the page up correctly. But what happens if the server is not a Mac?

As expected, it didn't work quite the same on the Unix host. Previously, the value \xc3\xb6 in the HTTP request parameter became two bytes in the database field, displaying a capital A with tilde and a Paragraph symbol. Now, the single byte value $9A was displaying as a single character s with the upside down circumflex accent. So clearly a simple hex encoding of the Mac Roman code point is no good.

I think my next step is to add a real iso-8859-1 encoding function in the AppleScript. It can just do a simple mapping from Mac Roman to iso-8859-1. This is probably better than putting it in the server code, even though I could do it in a few minutes in Java and will have to muck around in the unfamiliar AppleScript syntax, because the encoding issue is a client thing -- it's a mapping from Mac Roman to a common encoding that is the problem, and that should be done where the data originates. One small hitch I can think of is that the ampersand itself will have to be encoded since it is going into a URL. As long as the Java parameter getter decodes the ampersand in the parameters, it will be transparent on that side; otherwise, I'll have to make sure to decode it on the server side so that it reappears as an ampersand there.