Tech Note: Character Encoding Bug Hunt

• Views: 4,330

An open thread for a Friday morning; I’m chasing down a long-running, very annoying character translation bug, that causes Western European characters (with accènts, ümlauts, etc.) to show up as garbage when the Ajax system transfers them back and forth from the server.

I’m pretty sure I’ve finally killed the bug, but we can test my solution to destruction in this thread.

UPDATE at 5/30/08 12:11:33 pm:

Our plan for world domination is coming together, and the character encoding part of it now works very well. After trying a million or more different approaches, and only getting halfway there, the real solution involved a PHP function containing only 5 lines of code:

function convertLatin1ToHtml($str) {
$allEntities = get_html_translation_table( HTML_ENTITIES, ENT_NOQUOTES );
$specialEntities = get_html_translation_table( HTML_SPECIALCHARS, ENT_NOQUOTES );
$noTags = array_diff($allEntities, $specialEntities);
$str = strtr($str, $noTags);
return $str;
}

The source of the encoding problem is the way Javascript mishandles displaying raw (unencoded) European characters.

There’s no problem when Javascript reads the text field and sends the text to the server; before sending the text you’ll typically use a Javascript function like escape (if your pages are served as ISO-8859-1) or encodeURIComponent (if you’re serving UTF-8), and both of these functions correctly encode the extended characters so that PHP can translate them back.

The problem occurs on the return trip; if the PHP script sends back any raw extended characters, Javascript has a tantrum, dumps out a bunch of garbage, and embarrasses itself in front of the whole internet.

The solution: any text that may contain European characters and will be returned to a Javascript routine (for example, via XMLHttpRequest) needs to be passed through the function above to properly encode the extended characters as HTML entities (for example, “ü”).

This function exists because we can’t simply encode the whole text with a call to htmlentities. In comments and LGF articles, the text may contain HTML tags—and we don’t want those to be encoded or they’ll display as text in the browser instead of acting as HTML.

So the function above gets the two PHP translation tables (for htmlentities and htmlspecialchars), and calls array_diff to generate a translation table that omits all of the HTML-specific characters, such as < and >, and single/double quotes. Then it simply calls strtr (string translate) to replace those pesky foreign characters with their equivalent HTML entities, leaving the HTML tags and anything inside them alone.

And now we have a nice, safe, Javascript-friendly string that can be passed back to any browser and displayed correctly, without fear of embarrassment.

(Note: as usual, there’s a caveat with Internet Explorer—some HTML entities are not supported in IE by default, and may display as little boxes.)

Jump to top

Create a PageThis is the LGF Pages posting bookmarklet. To use it, drag this button to your browser's bookmark bar, and title it 'LGF Pages' (or whatever you like). Then browse to a site you want to post, select some text on the page to use for a quote, click the bookmarklet, and the Pages posting window will appear with the title, text, and any embedded video or audio files already filled in, ready to go.
Or... you can just click this button to open the Pages posting window right away.
Last updated: 2023-04-04 11:11 am PDT
LGF User's Guide RSS Feeds

Help support Little Green Footballs!

Subscribe now for ad-free access!Register and sign in to a free LGF account before subscribing, and your ad-free access will be automatically enabled.

Donate with
PayPal
Cash.app
Recent PagesClick to refresh
Once Praised, the Settlement to Help Sickened BP Oil Spill Workers Leaves Most With Nearly Nothing When a deadly explosion destroyed BP’s Deepwater Horizon drilling rig in the Gulf of Mexico, 134 million gallons of crude erupted into the sea over the next three months — and tens of thousands of ordinary people were hired ...
Cheechako
5 hours ago
Views: 52 • Comments: 0 • Rating: 0
Texas County at Center of Border Fight Is Overwhelmed by Migrant Deaths EAGLE PASS, Tex. - The undertaker lighted a cigarette and held it between his latex-gloved fingers as he stood over the bloated body bag lying in the bed of his battered pickup truck. The woman had been fished out ...
Cheechako
4 days ago
Views: 161 • Comments: 0 • Rating: 1