Tech Note: Character Encoding Bug Hunt
An open thread for a Friday morning; I’m chasing down a long-running, very annoying character translation bug, that causes Western European characters (with accènts, ümlauts, etc.) to show up as garbage when the Ajax system transfers them back and forth from the server.
I’m pretty sure I’ve finally killed the bug, but we can test my solution to destruction in this thread.
UPDATE at 5/30/08 12:11:33 pm:
Our plan for world domination is coming together, and the character encoding part of it now works very well. After trying a million or more different approaches, and only getting halfway there, the real solution involved a PHP function containing only 5 lines of code:
function convertLatin1ToHtml($str) {
$allEntities = get_html_translation_table( HTML_ENTITIES, ENT_NOQUOTES );
$specialEntities = get_html_translation_table( HTML_SPECIALCHARS, ENT_NOQUOTES );
$noTags = array_diff($allEntities, $specialEntities);
$str = strtr($str, $noTags);
return $str;
}
The source of the encoding problem is the way Javascript mishandles displaying raw (unencoded) European characters.
There’s no problem when Javascript reads the text field and sends the text to the server; before sending the text you’ll typically use a Javascript function like escape (if your pages are served as ISO-8859-1) or encodeURIComponent (if you’re serving UTF-8), and both of these functions correctly encode the extended characters so that PHP can translate them back.
The problem occurs on the return trip; if the PHP script sends back any raw extended characters, Javascript has a tantrum, dumps out a bunch of garbage, and embarrasses itself in front of the whole internet.
The solution: any text that may contain European characters and will be returned to a Javascript routine (for example, via XMLHttpRequest) needs to be passed through the function above to properly encode the extended characters as HTML entities (for example, “ü”).
This function exists because we can’t simply encode the whole text with a call to htmlentities. In comments and LGF articles, the text may contain HTML tags—and we don’t want those to be encoded or they’ll display as text in the browser instead of acting as HTML.
So the function above gets the two PHP translation tables (for htmlentities and htmlspecialchars), and calls array_diff to generate a translation table that omits all of the HTML-specific characters, such as < and >, and single/double quotes. Then it simply calls strtr (string translate) to replace those pesky foreign characters with their equivalent HTML entities, leaving the HTML tags and anything inside them alone.
And now we have a nice, safe, Javascript-friendly string that can be passed back to any browser and displayed correctly, without fear of embarrassment.
(Note: as usual, there’s a caveat with Internet Explorer—some HTML entities are not supported in IE by default, and may display as little boxes.)