Tech Note: A PHP Function to Strip Specific HTML Tags and Attributes

Parse those tags, mama
LGF • Views: 43,484
Image via Shutterstock

Every once in a while I come up with a bit of code that does something well enough I think it’s worth sharing, especially if it performs a function commonly used by lots of programmers, with a method that doesn’t already have hundreds of results in a Google search. And here’s one of those bits; maybe this will come in handy to another coder out there on the Internets, searching in vain for a good routine to parse tags and attributes.

For a long time I’ve been using a function to “sanitize” user input (e.g. comments and LGF Pages) to ensure no malicious code ever gets posted either on purpose or by accident. But I also allow certain HTML tags, and certain attributes for those tags.

The problem with the function I was using is that it employed regular expressions to parse the HTML, and that’s just a big freaking headache — inflexible and fragile. If you search a site like Stack Overflow for info on parsing HTML with regular expressions you’ll see tons of comments telling you “DO NOT DO THIS.” But the code worked well enough for a long time — until I started needing to allow HTML5 “data” attributes inside some tags.

Data attributes typically look like this:

data-conversation="none"

The word “data” is followed by a hyphen and then a variable name for the type of data involved. And that’s where my old attribute parsing function started to become a real hassle to use, because it was unable to deal with that variable part of the name easily, and I needed to add every type of data attribute I used to a list as they came up.

So today I came up with a vastly improved function that uses PHP’s DOMDocument library to actually build a Document Object Model out of the HTML code, then find the tags and attributes the right way instead of the bad old regex way, and remove everything except the tags and attributes I want to leave in place.

The new approach lets me use a regular expression only to match all types of data attributes — not to to find those attributes inside a big pile of HTML code. The task of finding the tags and attributes is handled by the XPath syntax of PHP’s DOMDocument methods.

Without further explanation, here’s that new function. It takes a string of HTML to “sanitize” and two arrays as parameters; the arrays are a list of the allowed tags, and a list of the allowed attributes. Note that if “href” or “src” attributes are allowed, the function checks to see if the value of the attribute is Javascript code, and changes it to “#” if so.

<?php
function stripTagsAttributes($html, $allowedTags = array(), $allowedAttributes = array('(?:a^)')) {
	if (!empty($html)) {
		$theTags = count($allowedTags) ? '<' . implode('><', $allowedTags) . '>' : '';
		$theAttributes = '%' . implode('|', $allowedAttributes) . '%i';
		$dom = @DOMDocument::loadHTML(
			mb_convert_encoding(
				strip_tags(
					$html,
					$theTags
				),
				'HTML-ENTITIES',
				'UTF-8'
			)
		);
		$xpath = new DOMXPath($dom);
		$tags = $xpath->query('//*');
		foreach ($tags as $tag) {
			$attrs = array();
			for ($i = 0; $i < $tag->attributes->length; $i++) {
				$attrs[] = $tag->attributes->item($i)->name;
			}
			foreach ($attrs as $attribute) {
				if (!preg_match($theAttributes, $attribute)) {
					$tag->removeAttribute($attribute);
				} elseif (preg_match('%^(?:href|src)$%i', $attribute) and preg_match('%^javascript:%i', $tag->getAttribute($attribute))) {
					$tag->setAttribute($attribute, '#');
				}
			}
		}
		return (
			trim(
				strip_tags(
					html_entity_decode(
						$dom->saveHTML()
					),
					$theTags
				)
			)
		);
	}
}
?>

There are a couple of gotchas with using PHP’s DOMDocument library I should mention:

  1. If you're parsing a fragment of HTML code instead of an entire page, and it doesn't have a character encoding tag, DOMDocument will assume the text is encoded in ISO-8859 instead of the much preferred UTF-8. So this function uses the mb_convert_encoding function (line 7) to convert any Unicode characters into HTML entities before loading the code fragment, then uses html_entity_decode (line 32) to convert the entities back into characters when the parsing is finished.
  2. The second gotcha is that when you're parsing an HTML fragment, DOMDocument always adds a DOCTYPE, <html> and <body> tags to the fragment, and you can't disable this "feature." So after parsing, my function uses strip_tags a second time to remove those extra unneeded tags.

Oh yes, and one more thing; this is what the two arrays of tags and attributes look like for our comments; notice that the last item in the $commentAttributes array is a simple regular expression that matches any type of data attribute:

<?php
$commentTags = array(
	'b',
	'i',
	'a',
	'strong',
	'em',
	'blockquote',
	'div',
	'p',
	'br',
	'strike',
	'del',
	'sup',
	'sub',
	'code',
	'pre',
	'span',
	'img',
	'button',
);
$commentAttributes = array(
	'href',
	'rel',
	'target',
	'src',
	'width',
	'height',
	'class',
	'data-\S*'
);
?>

There are more sanitization measures in place when comments are posted, by the way; for example, images can only be embedded in comments if they’re uploaded and hosted at LGF, otherwise they’re transformed into links to the external images. This is to prevent anyone posting malicious images that contain code, porn, etc.

Jump to top

Create a PageThis is the LGF Pages posting bookmarklet. To use it, drag this button to your browser's bookmark bar, and title it 'LGF Pages' (or whatever you like). Then browse to a site you want to post, select some text on the page to use for a quote, click the bookmarklet, and the Pages posting window will appear with the title, text, and any embedded video or audio files already filled in, ready to go.
Or... you can just click this button to open the Pages posting window right away.
Last updated: 2016-01-01 10:29 am PST
LGF User's Guide RSS Feeds Tweet

Help support Little Green Footballs!

Subscribe now for ad-free access!Register and sign in to a free LGF account before subscribing, and your ad-free access will be automatically enabled.

Donate with
PayPal
Square Cash Shop at amazon
as an LGF Associate!
Recent PagesClick to refresh
The Warped World of 1950’s Marriage Counselling (This article was previously written for the closed Yahoo group "Intertel Atheists" on September 27, 2014.) I am the manager of that Yahoo group, and the author of the article. Rights to articles at Intertel are retained by the ...
Anymouse 🌹
13 minutes ago
Views: 84 • Comments: 0 • Rating: 1
Tweets: 0 • Share to Facebook
Shares: 0
Comments: 0
: 0
Mad Men: Inside the Men’s Rights Movement—and the Army of Misogynists and Trolls It Spawned - Mother Jones On a balmy afternoon last June, dozens of demonstrators carrying “Stop the Violence” and “Rape is Rape” placards descended on the Hilton DoubleTree in downtown Detroit. They had come to protest the first-ever national gathering of the men’s rights ...
Birth Control Works
18 hours, 32 minutes ago
Views: 234 • Comments: 1 • Rating: 0
Tweets: 1 • Share to Facebook
Shares: 0
Comments: 0
: 0
Men Legally Allowed to Finish Sex Even if Woman Revokes Consent, NC Law States - Broadly ne May evening in 1977, Beverly Hester was sexually assaulted. According to the summary included in the North Carolina Supreme Court decision State v. Way, she testified that the perpetrator, Donnie Way, threatened to beat her if she didn't ...
Birth Control Works
18 hours, 39 minutes ago
Views: 231 • Comments: 0 • Rating: 0
Tweets: 3 • Share to Facebook
Shares: 0
Comments: 0
: 0
4 USAF Aircraft Destroyed or Damaged in a Week Thunderbirds F16 flipped when landing after their practice flight for Dayton Air Show. Two people hurt. 📷WHIO pic.twitter.com/LRcuVucKIm &mdash; Tom Podolec CTV News (@TomPodolec) June 23, 2017 A few days ago an F-16 crashed on takeoff at Ellington AFB, ...
Unshaken Defiance
20 hours, 44 minutes ago
Views: 235 • Comments: 0 • Rating: 0
Tweets: 4 • Share to Facebook
Shares: 0
Comments: 0
: 0
The Mets, Jets, and Dems; My Triumvirate of Lost Causesblog.timesunion.comFor pictures, click on the link above. The Mets, Jets, and Dems; my triumvirate of lost causesBy Rob Hoffman on June 26, 2017 at 5:53 AM0I’m in a slump. This is an undeniable fact, and not one of those alternative ...
rhoffman
2 days, 1 hour ago
Views: 284 • Comments: 1 • Rating: -2
Tweets: 2 • Share to Facebook
Shares: 1
Comments: 0
: 1
Why Are So Many Queer Girls in Juvie? The findings come out of years of research done by Angela Irvine and Aisha Canfield for the National Council on Crime and Delinquency (both now work at Impact Justice). For this particular statistic, they anonymously surveyed 1,400 girls in ...
Birth Control Works
2 days, 14 hours ago
Views: 276 • Comments: 0 • Rating: 0
Tweets: 0 • Share to Facebook
Shares: 0
Comments: 0
: 0
North Carolina Televangelist Indicted on Charges of Tax Crimes Coontz, 50, is described on his website as a "pastor, evangelist, television host, author, humanitarian, philanthropist, businessman." "a 2011 BMW, a 2011 Regal 2500 boat, a 2012 BMW convertible, a 2011 Lexus, a 2011 Land Rover, a 2006 Ferrari, ...
Tarkloon
3 days, 12 hours ago
Views: 348 • Comments: 1 • Rating: 2
Tweets: 1 • Share to Facebook
Shares: 1
Comments: 1
: 1
Isaac Asimov’s Best Arguments of All Time, Part One ➤ #Subscribe: goo.gl➤ Facebook: goo.gl➤ Twitter: goo.gl➤ Google+ : goo.gl➤ Site: goo.gl➤ Thanks for watching :) #Sciencetoday is channel uses for #education, #teaching, #review, #commentary, or research... If you have any issues with content, please contact us, for an ...
Tarkloon
3 days, 12 hours ago
Views: 304 • Comments: 0 • Rating: 2
Tweets: 0 • Share to Facebook
Shares: 0
Comments: 0
: 0
AWOLNATION - Hollow Moon (Bad Wolf) (Official Video) AWOLNATION's official video for "Hollow Moon (Bad Wolf)". Watch the new AWOLNATION video Run (Beautiful Things) - YouTube From AWOLNATION’s new album, ‘RUN’, available now at:iTunes - smarturl.it CDs, Vinyl + Bundles - smarturl.itGoogle Play - smarturl.it Stream ‘RUN’ ...
Tarkloon
3 days, 12 hours ago
Views: 255 • Comments: 0 • Rating: 1
Tweets: 0 • Share to Facebook
Shares: 0
Comments: 0
: 0
Judge Fines Kris Kobach $1K for Misleading Court on Materials He Brought to Trump Meeting Just another GOP liar, it's all they know how to do. WICHITA — A judge has fined Kansas Secretary of State Kris Kobach $1,000 for misleading the court about the contents of materials he was photographed taking into a ...
Tarkloon
3 days, 16 hours ago
Views: 626 • Comments: 0 • Rating: 0
Tweets: 1 • Share to Facebook
Shares: 2
Comments: 2
: 2