Tech Note: Character Encoding Bug Hunt

Charles Johnsonfollow me on twitter
Fri May 30, 2008 at 11:14 am PDT • Views: 275

An open thread for a Friday morning; I’m chasing down a long-running, very annoying character translation bug, that causes Western European characters (with accènts, ümlauts, etc.) to show up as garbage when the Ajax system transfers them back and forth from the server.

I’m pretty sure I’ve finally killed the bug, but we can test my solution to destruction in this thread.

UPDATE at 5/30/08 12:11:33 pm:

Our plan for world domination is coming together, and the character encoding part of it now works very well. After trying a million or more different approaches, and only getting halfway there, the real solution involved a PHP function containing only 5 lines of code:

function convertLatin1ToHtml($str) {
$allEntities = get_html_translation_table( HTML_ENTITIES, ENT_NOQUOTES );
$specialEntities = get_html_translation_table( HTML_SPECIALCHARS, ENT_NOQUOTES );
$noTags = array_diff($allEntities, $specialEntities);
$str = strtr($str, $noTags);
return $str;
}

The source of the encoding problem is the way Javascript mishandles displaying raw (unencoded) European characters.

There’s no problem when Javascript reads the text field and sends the text to the server; before sending the text you’ll typically use a Javascript function like escape (if your pages are served as ISO-8859-1) or encodeURIComponent (if you’re serving UTF-8), and both of these functions correctly encode the extended characters so that PHP can translate them back.

The problem occurs on the return trip; if the PHP script sends back any raw extended characters, Javascript has a tantrum, dumps out a bunch of garbage, and embarrasses itself in front of the whole internet.

The solution: any text that may contain European characters and will be returned to a Javascript routine (for example, via XMLHttpRequest) needs to be passed through the function above to properly encode the extended characters as HTML entities (for example, “ü”).

This function exists because we can’t simply encode the whole text with a call to htmlentities. In comments and LGF articles, the text may contain HTML tags—and we don’t want those to be encoded or they’ll display as text in the browser instead of acting as HTML.

So the function above gets the two PHP translation tables (for htmlentities and htmlspecialchars), and calls array_diff to generate a translation table that omits all of the HTML-specific characters, such as < and >, and single/double quotes. Then it simply calls strtr (string translate) to replace those pesky foreign characters with their equivalent HTML entities, leaving the HTML tags and anything inside them alone.

And now we have a nice, safe, Javascript-friendly string that can be passed back to any browser and displayed correctly, without fear of embarrassment.

(Note: as usual, there’s a caveat with Internet Explorer—some HTML entities are not supported in IE by default, and may display as little boxes.)

Advertisement

581 comments

^ back to top ^

Name:

Pass:

Register Forgot Your Password? Re-send Confirmation (To log in, cookies must be enabled in your browser!)

Turn off ads by subscribing!
For about 33 cents a day, our subscription option turns off all advertisements at LGF!
Read more...


► LGF Headlines

  • Loading...

► Tweeted Articles

  • Loading...

► Tweeted Pages

  • Loading...

► Top 10 Comments

  • Loading...

► Bottom Comments

  • Loading...

► Recent Comments

  • Loading...

► Tools/Info

► Tag Cloud

► Contact

You must have Javascript enabled to use the contact form.
Your email:

Subject:

Message:


Messages may be published in our weblog, unless you request otherwise.
Tech Note:
Using the Contact Form

More Partners

Compare Electricity Prices in your area. Texas Electricity is deregulated; you have the right to choose Texas Electric Rates from among many Texas Electric Companies.

See how you are?

TwitterFacebook
LGF Pages
Recent Pages

researchok
'I Was Looking Forward to a Quiet Old Age': Instead, Etta Shiber, Helped Smuggle Stranded Allied Soldiers To Freedom
5 hours ago
Views: 66 • Comments: 0
Tweets: 1 • Rating: 0

Daniel Ballard
Late Afternoon Light-Kalanchoe
12 hours, 41 minutes ago
Views: 102 • Comments: 0
Tweets: 0 • Rating: 4

MikeySDCA
Colin Powell Endorsed Same-Sex Marriage Once It Was Safe, More Evidence He's Hardly a Great Leader.
12 hours, 45 minutes ago
Views: 135 • Comments: 1
Tweets: 0 • Rating: 1

Eclectic Infidel
City College of San Francisco Budget Update
13 hours, 33 minutes ago
Views: 121 • Comments: 0
Tweets: 0 • Rating: 0

Michael McBacon
Kansas governor signs 'Shariah bill' to ban Islamic law
18 hours, 4 minutes ago
Views: 233 • Comments: 6
Tweets: 0 • Rating: 5

Aigle
National Geographic Traveler Veers Off Track
1 day, 18 hours ago
Views: 454 • Comments: 7
Tweets: 0 • Rating: -5

MichaelJ
Apple TV Slated to Debut in December?
1 day, 19 hours ago
Views: 227 • Comments: 0
Tweets: 0 • Rating: 1

Ascher
Israeli Who Saved Turk on Everest: You Never Abandon a Friend - Israel News, Ynetnews
1 day, 20 hours ago
Views: 299 • Comments: 1
Tweets: 0 • Rating: 3

Haywood Jabloeme
The Harrassment of Patterico & Its Roots in Left-Wing Activism
1 day, 20 hours ago
Views: 521 • Comments: 2
Tweets: 0 • Rating: 4

Curt
Brian Banks: (Video) Falsely accused of rape speaks out
1 day, 23 hours ago
Views: 274 • Comments: 2
Tweets: 0 • Rating: 5

 Frank says:

Sopranos!? That's why God made the rocket launcher and grenade!