I need to debug a small but very visible problem that is hiding somewhere in the
pipeline of different pieces of code that moves the iTunes song info into these
web pages.
I'll discuss the different pieces and what I find. Any insight would be appreciated,
and you can send it to trygve (at) bombaydigital (dot) com.
The Problem
In the Now Playing section in the sidebar, the text for the song title, artist name,
and album title are displayed. I noticed that the text was garbled for a couple of
albums that got displayed, and as you might expect it was due to non-US-English
characters. The visible examples were anything by Björk or Blue Öyster Cult. Obviously,
somewhere in the text pipleline, the umlaut-accented o is getting mangled into
the wrong thing. After a little digging around, I see that these particular characters,
the lower and upper case o with umlaut, are represented in HTML as either ö
and Ö, or ö and Ö.
The text starts out in iTunes in one of the text fields, artist name in this case.
iTunes displays the character correctly.
An AppleScript gets the text object from the field in the AppleScript object model.
I need to make sure that the text extracted from iTunes gets properly encoded by
the script before it is sent to the next step. My guess is that this is source of
the problem.
The AppleScript then invokes the Unix command curl, which makes a connection to
the server to make an HTTP request that contains the text in the parameter list.
I need to make sure that the text is properly encoded into the parameter list,
and that the request is properly formed.
The Java servlet handler code on the server, that processes the HTTP request, gets the
parameter from the request and then inserts text into the MySQL database in columns that
are defined as strings. I need to make sure that
the SQL commands format the text in a way that is compatible with SQL and MySQL in particular.
The servlet code then queries the database for recent songs. I need to make
sure that the text that comes back out of the database in a correct form, or is
suitably transformed.
Finally, the servlet code writes an HTML snippet file, which is pulled into
the sidebar of various pages verbatim. I need to make sure that the snippet's
copy of the text is correct formed for HTML.
Debugging
When I looked at my server logs, I could see clearly that the request parameters
were not encoded correctly, or at least not in a way suitable for eventually
appearing in the HTML page. The ö in Björk was sent as \xc3\xb6,
and the Ö in Blue Öyster Cult was sent as \xc3\x96.
My first inclination was to make sure that the AppleScript was encoding the text
in a reasonable way. The script that I originally found and tweaked only did a
rudimentary escaping
of characters to safely pass them to the shell. It was not really encoding anything
beyond that.
I found a few
scripts on the Apple site that did some encoding stuff, but it looked
to me like it just encoded stuff besides A-Z and 0-9 into percent-sign-escaped
hex values. Was that enough to do the right thing? Probably not.
With the addition of the encoding provided by the function found on the Apple site,
the ö was encoded as %9A, and the Ö was encoded as %85.
Sure enough, these are the code points of these characters in the Mac Roman character
set. You can see all the Mac Roman code points in the ASCII viewer window of my trusty
utility application Hex Wrench.
(It's a Classic app that runs perfectly OK in OS X under Classic.) Clearly this is
not sufficient for exporting to the outside world in an HTTP request!

(Mac Roman ASCII code points, grid courtesy of Hex Wrench)
But surprisingly, this worked in my initial test, and not just viewing it on Mac, but on Windows.
The hex-encoded value actually worked better than the raw value, despite being the
Mac Roman encoding. I tested viewing
the page from browsers on both platforms. Viewing the HTML snippet, what appeared there was
in fact not encoded, but simply the desired character. My guess was
that this was just pure dumb luck: it's writing the Mac Roman byte value for the
character, and the server is running on the Mac and serving the page up correctly.
But what happens if the server is not a Mac?
As expected, it didn't work quite the same on the Unix host. Previously, the value
\xc3\xb6 in the HTTP request parameter became two bytes in the database
field, displaying a capital A with tilde and a Paragraph symbol. Now, the single
byte value $9A was displaying as a single character s with the upside down circumflex
accent. So clearly a simple hex encoding of the Mac Roman code point is no good.
I think my next step is to add a real iso-8859-1 encoding function in the AppleScript. It
can just do a simple mapping from Mac Roman to iso-8859-1. This
is probably better than putting it in the server code, even though I could do it in
a few minutes in Java and will have to muck around in the unfamiliar AppleScript syntax,
because the encoding issue is a client thing -- it's a mapping from Mac Roman to a
common encoding that is the problem, and that should be done where the data originates.
One small hitch I can think of is that the ampersand itself will have to be encoded
since it is going into a URL. As long as the Java parameter getter decodes the ampersand
in the parameters, it will be transparent on that side; otherwise, I'll have to make
sure to decode it on the server side so that it reappears as an ampersand there.