Tech Note: Character Encoding Bug Hunt

Charles Johnsonfollow me on twitter
Fri May 30, 2008 at 11:14 am PDT • Views: 237

An open thread for a Friday morning; I’m chasing down a long-running, very annoying character translation bug, that causes Western European characters (with accènts, ümlauts, etc.) to show up as garbage when the Ajax system transfers them back and forth from the server.

I’m pretty sure I’ve finally killed the bug, but we can test my solution to destruction in this thread.

UPDATE at 5/30/08 12:11:33 pm:

Our plan for world domination is coming together, and the character encoding part of it now works very well. After trying a million or more different approaches, and only getting halfway there, the real solution involved a PHP function containing only 5 lines of code:

function convertLatin1ToHtml($str) {
$allEntities = get_html_translation_table( HTML_ENTITIES, ENT_NOQUOTES );
$specialEntities = get_html_translation_table( HTML_SPECIALCHARS, ENT_NOQUOTES );
$noTags = array_diff($allEntities, $specialEntities);
$str = strtr($str, $noTags);
return $str;
}

The source of the encoding problem is the way Javascript mishandles displaying raw (unencoded) European characters.

There’s no problem when Javascript reads the text field and sends the text to the server; before sending the text you’ll typically use a Javascript function like escape (if your pages are served as ISO-8859-1) or encodeURIComponent (if you’re serving UTF-8), and both of these functions correctly encode the extended characters so that PHP can translate them back.

The problem occurs on the return trip; if the PHP script sends back any raw extended characters, Javascript has a tantrum, dumps out a bunch of garbage, and embarrasses itself in front of the whole internet.

The solution: any text that may contain European characters and will be returned to a Javascript routine (for example, via XMLHttpRequest) needs to be passed through the function above to properly encode the extended characters as HTML entities (for example, “ü”).

This function exists because we can’t simply encode the whole text with a call to htmlentities. In comments and LGF articles, the text may contain HTML tags—and we don’t want those to be encoded or they’ll display as text in the browser instead of acting as HTML.

So the function above gets the two PHP translation tables (for htmlentities and htmlspecialchars), and calls array_diff to generate a translation table that omits all of the HTML-specific characters, such as < and >, and single/double quotes. Then it simply calls strtr (string translate) to replace those pesky foreign characters with their equivalent HTML entities, leaving the HTML tags and anything inside them alone.

And now we have a nice, safe, Javascript-friendly string that can be passed back to any browser and displayed correctly, without fear of embarrassment.

(Note: as usual, there’s a caveat with Internet Explorer—some HTML entities are not supported in IE by default, and may display as little boxes.)

Advertisement

581 comments

^ back to top ^

Name:

Pass:

Register Forgot Your Password? Account Settings Re-send Confirmation (To log in, cookies must be enabled in your browser!)

Turn off ads by subscribing!
For about 33 cents a day, our subscription option turns off all advertisements at LGF!
Read more...


► LGF Headlines

  • Loading...

► Tweeted Articles

  • Loading...

► Tweeted Pages

  • Loading...

► Top 10 Comments

  • Loading...

► Bottom Comments

  • Loading...

► Recent Comments

  • Loading...

► Tools/Info

► LGF Hits

► Resources

► Never Forget

► Statistics

► Tag Cloud

► Contact

You must have Javascript enabled to use the contact form.
Your email:

Subject:

Message:


Messages may be published in our weblog, unless you request otherwise.
Tech Note:
Using the Contact Form

More Partners

Compare Electricity Prices in your area. Texas Electricity is deregulated; you have the right to choose Texas Electric Rates from among many Texas Electric Companies.

Contains petroleum distillates.

TwitterFacebook
LGF Pages
Recent Pages

MichaelJ
Apple Asks Outside Group to Inspect Factories
6 minutes ago
Views: 19 • Comments: 0
Tweets: 0 • Rating: 0

The Optimist
Uganda, a Corrupt Oil Producer Since 2006 Now More Corrupt
15 minutes ago
Views: 12 • Comments: 0
Tweets: 0 • Rating: 0

Kid A
'Hating Breitbart' To Debut Spring or Summer.
16 minutes ago
Views: 32 • Comments: 0
Tweets: 1 • Rating: 0

Channeling Confucius
Syria 'Emboldened by UN Inaction'
1 hour, 33 minutes ago
Views: 50 • Comments: 0
Tweets: 0 • Rating: 0

Channeling Confucius
Coming out
1 hour, 43 minutes ago
Views: 68 • Comments: 0
Tweets: 0 • Rating: 0

Daniel Ballard
Animal-Made 'Art' Challenges Human Monopoly on Creativity
2 hours, 8 minutes ago
Views: 87 • Comments: 1
Tweets: 0 • Rating: 1

Daniel Ballard
Congress Left in Dark on DOJ Wiretaps
2 hours, 11 minutes ago
Views: 75 • Comments: 0
Tweets: 0 • Rating: 1

researchok
Obama Budget: National Debt Will Be $1 Trillion Higher in a Decade Than Previously Forecast
2 hours, 35 minutes ago
Views: 89 • Comments: 0
Tweets: 0 • Rating: 0

Interesting Times
24 Hours to Stop Keystone XL
3 hours, 19 minutes ago
Views: 150 • Comments: 0
Tweets: 8 • Rating: 0

researchok
California's Nuclear Alarm Bells
3 hours, 23 minutes ago
Views: 106 • Comments: 0
Tweets: 0 • Rating: 1

 Frank says:

Seriousity is something to be laughed at. -- FZ responding to Dutch television after being told that Europeans take Frank's music very seriously.