Arena Red » 20 Oct 2003 » Character Encoding Resolved
« Debugging the Text Pipeline | Doubleplusgood »
Character Encoding Resolved

Following up on this morning's post .

It took a bit of experimentation to see just what concoction of encoding would flow all the way from iTunes to AppleScript to curl to Tomcat to MySQL and back and finally to the HTML text file for rendering in a browser.

The text extracted from iTunes in AppleScript starts in Mac Roman ASCII code points. This text has to be escaped and encoded in a couple of ways in order to flow the rest of the way correctly.

First, the obvious punctuation that can interfere with the shell and with URL construction have to be escaped in the percent+hex form. For example, & must be escaped as %26. These basic punctuation characters don't need a more complex escape sequence because they are in the core ASCII code set that does not conflict with anything in the chain.

Next, the characters above 128 need to be mapped to ISO 8859-1 values and escaped in the ampersand-pound-decimal-semicolon form. For example, ö must be escaped as ö. Actually, instead of the #246 decimal value, we could use the symbol's name ouml, yielding the escape sequence &#ouml;. But the code I found used the decimal values, so I didn't have to research the names of all those glyphs.

But the ISO escape sequences can't simply be of the form ö. The problem is that the ampersand symbol appears in the URL and will mess it up, such that the server will not get the intended parameters. So we must form the ISO escape sequences by hex encoding the punctuation inside them. So the ö actually gets represented as %26%23246; in order to be properly "buried" in the parameter list. Because the servlet engine knows how to decode these hex values, we get the desired character string ö placed in the database. Once stored correctly in the database, it ends up correct in the HTML.

Top 10 of 1597 Referrers:
[34] Google: "character encoding in tomcat"
[18] Google: "tomcat encoding"
[12] Google: "applescript escape character"
[10] Google: "tomcat encoding"
[9] Google: "tomcat character encoding"
[9] Google: ""-Djava" encoding"
[8] Google: "tomcat character encoding"
[8] Google: "tomcat character encoding"
[7] Google: "tomcat encoding"
[7] Google: "applescript escape"