Tech Note: A PHP Function to Strip Specific HTML Tags and Attributes

Parse those tags, mama
LGF • Views: 46,431
Image via Shutterstock

Every once in a while I come up with a bit of code that does something well enough I think it’s worth sharing, especially if it performs a function commonly used by lots of programmers, with a method that doesn’t already have hundreds of results in a Google search. And here’s one of those bits; maybe this will come in handy to another coder out there on the Internets, searching in vain for a good routine to parse tags and attributes.

For a long time I’ve been using a function to “sanitize” user input (e.g. comments and LGF Pages) to ensure no malicious code ever gets posted either on purpose or by accident. But I also allow certain HTML tags, and certain attributes for those tags.

The problem with the function I was using is that it employed regular expressions to parse the HTML, and that’s just a big freaking headache — inflexible and fragile. If you search a site like Stack Overflow for info on parsing HTML with regular expressions you’ll see tons of comments telling you “DO NOT DO THIS.” But the code worked well enough for a long time — until I started needing to allow HTML5 “data” attributes inside some tags.

Data attributes typically look like this:

data-conversation="none"

The word “data” is followed by a hyphen and then a variable name for the type of data involved. And that’s where my old attribute parsing function started to become a real hassle to use, because it was unable to deal with that variable part of the name easily, and I needed to add every type of data attribute I used to a list as they came up.

So today I came up with a vastly improved function that uses PHP’s DOMDocument library to actually build a Document Object Model out of the HTML code, then find the tags and attributes the right way instead of the bad old regex way, and remove everything except the tags and attributes I want to leave in place.

The new approach lets me use a regular expression only to match all types of data attributes — not to to find those attributes inside a big pile of HTML code. The task of finding the tags and attributes is handled by the XPath syntax of PHP’s DOMDocument methods.

Without further explanation, here’s that new function. It takes a string of HTML to “sanitize” and two arrays as parameters; the arrays are a list of the allowed tags, and a list of the allowed attributes. Note that if “href” or “src” attributes are allowed, the function checks to see if the value of the attribute is Javascript code, and changes it to “#” if so.

<?php
function stripTagsAttributes($html, $allowedTags = array(), $allowedAttributes = array('(?:a^)')) {
	if (!empty($html)) {
		$theTags = count($allowedTags) ? '<' . implode('><', $allowedTags) . '>' : '';
		$theAttributes = '%' . implode('|', $allowedAttributes) . '%i';
		$dom = @DOMDocument::loadHTML(
			mb_convert_encoding(
				strip_tags(
					$html,
					$theTags
				),
				'HTML-ENTITIES',
				'UTF-8'
			)
		);
		$xpath = new DOMXPath($dom);
		$tags = $xpath->query('//*');
		foreach ($tags as $tag) {
			$attrs = array();
			for ($i = 0; $i < $tag->attributes->length; $i++) {
				$attrs[] = $tag->attributes->item($i)->name;
			}
			foreach ($attrs as $attribute) {
				if (!preg_match($theAttributes, $attribute)) {
					$tag->removeAttribute($attribute);
				} elseif (preg_match('%^(?:href|src)$%i', $attribute) and preg_match('%^javascript:%i', $tag->getAttribute($attribute))) {
					$tag->setAttribute($attribute, '#');
				}
			}
		}
		return (
			trim(
				strip_tags(
					html_entity_decode(
						$dom->saveHTML()
					),
					$theTags
				)
			)
		);
	}
}
?>

There are a couple of gotchas with using PHP’s DOMDocument library I should mention:

  1. If you're parsing a fragment of HTML code instead of an entire page, and it doesn't have a character encoding tag, DOMDocument will assume the text is encoded in ISO-8859 instead of the much preferred UTF-8. So this function uses the mb_convert_encoding function (line 7) to convert any Unicode characters into HTML entities before loading the code fragment, then uses html_entity_decode (line 32) to convert the entities back into characters when the parsing is finished.
  2. The second gotcha is that when you're parsing an HTML fragment, DOMDocument always adds a DOCTYPE, <html> and <body> tags to the fragment, and you can't disable this "feature." So after parsing, my function uses strip_tags a second time to remove those extra unneeded tags.

Oh yes, and one more thing; this is what the two arrays of tags and attributes look like for our comments; notice that the last item in the $commentAttributes array is a simple regular expression that matches any type of data attribute:

<?php
$commentTags = array(
	'b',
	'i',
	'a',
	'strong',
	'em',
	'blockquote',
	'div',
	'p',
	'br',
	'strike',
	'del',
	'sup',
	'sub',
	'code',
	'pre',
	'span',
	'img',
	'button',
);
$commentAttributes = array(
	'href',
	'rel',
	'target',
	'src',
	'width',
	'height',
	'class',
	'data-\S*'
);
?>

There are more sanitization measures in place when comments are posted, by the way; for example, images can only be embedded in comments if they’re uploaded and hosted at LGF, otherwise they’re transformed into links to the external images. This is to prevent anyone posting malicious images that contain code, porn, etc.

Jump to top

Create a PageThis is the LGF Pages posting bookmarklet. To use it, drag this button to your browser's bookmark bar, and title it 'LGF Pages' (or whatever you like). Then browse to a site you want to post, select some text on the page to use for a quote, click the bookmarklet, and the Pages posting window will appear with the title, text, and any embedded video or audio files already filled in, ready to go.
Or... you can just click this button to open the Pages posting window right away.
Last updated: 2016-01-01 10:29 am PST
LGF User's Guide RSS Feeds Tweet

Help support Little Green Footballs!

Subscribe now for ad-free access!Register and sign in to a free LGF account before subscribing, and your ad-free access will be automatically enabled.

Donate with
PayPal
Square Cash Shop at amazon
as an LGF Associate!
Recent PagesClick to refresh
Eat This, Not That Other Stuffblog.timesunion.com By Rob Hoffman on May 24, 2018 at 5:29 AM2MoreA few years ago, a book came out that became very popular amongst those who spend a lot of their time dieting. The book was called Eat This, Not That. ...
rhoffman
2 hours, 14 minutes ago
Views: 73 • Comments: 0 • Rating: 0
Tweets: 2 • Share to Facebook
Shares: 1
Comments: 0
: 1
This Is the Kit - Full Performance (Live on KEXP) kexp.org presents This Is The Kit performing live in the KEXP studio. Recorded November 28, 2017. Songs:Empty No TeethBullet ProofMoonshine FreezeSolid Grease Host: Troy NelsonAudio Engineer: Kevin SuggsCameras: Jim Beckmann, Scott Holpainen & Justin WilmoreEditor: Justin Wilmore kexp.orgthisisthekit.co.uk ...
Thanos
3 days, 18 hours ago
Views: 145 • Comments: 0 • Rating: 0
Tweets: 1 • Share to Facebook
Shares: 0
Comments: 0
: 0
Starry Starry Night -Lianne La Havas Provided to YouTube by Warner Music Group Starry Starry Night · Lianne La Havas Loving Vincent (Original Soundtrack) ℗ 2017 Un Pundeas De Monetas, Inc. Under exclusive license to Éditions Milan Music Featured Vocals: Lianne La HavasComposer: Don McLean ...
Thanos
3 days, 18 hours ago
Views: 151 • Comments: 0 • Rating: 0
Tweets: 1 • Share to Facebook
Shares: 0
Comments: 0
: 0
LGF Sunday Morning Wakeup Playlist
Unshaken Defiance
3 days, 23 hours ago
Views: 147 • Comments: 0 • Rating: 0
Tweets: 2 • Share to Facebook
Shares: 0
Comments: 0
: 0
Masterpiece Theater-Little WomenTonight Masterpiece Theater presents Little Women. This latest production promises to be a good one with some well known actors and actresses. The March girls are played by relative unknowns. Please check your local PBS stations for the times in ...
PhillyPretzel
1 week, 3 days ago
Views: 517 • Comments: 0 • Rating: 1
Tweets: 0 • Share to Facebook
Shares: 0
Comments: 0
: 0
Ry Cooder - the Prodigal Son (Live in Studio) "The Prodigal Son", Ry' Cooder's first new solo release in six years, is set for release May 11, 2018! Pre-order / stream "The Prodigal Son" here: found.ee Directed by Jeff Coffman Follow Ry:Official Website: rycooder.comFacebook: facebook.comInstagram: instagram.com LYRICSnow, the ...
Thanos
1 week, 4 days ago
Views: 680 • Comments: 1 • Rating: 1
Tweets: 1 • Share to Facebook
Shares: 0
Comments: 0
: 0
Charlie Puth - ‘How Long’ [Official Video] Download and Stream "How Long": atlantic.lnk.to Pre-Order Voicenotes: atlantic.lnk.to Exclusive VoiceNotes Merchandise Bundles Available Here: smarturl.it Follow Charlie:charlieputh.comtwitter.comfacebook.cominstagram.com THE VOICENOTES TOURwith Charlie Puth and Hailee SteinfeldTickets & VIP: charlieputh.com #VoicenotesTour Dates:07/11 – Toronto, ON – Budweiser Stage07/13 – Boston, ...
Thanos
1 week, 4 days ago
Views: 239 • Comments: 0 • Rating: 0
Tweets: 0 • Share to Facebook
Shares: 0
Comments: 0
: 0
Welles - Rock N Roll [Audio Only] PreOrder Welles Album "Red Trees and White Trashes" featuring "Seventeen" and "Rock N Roll"ffm.to Listen to Rock N Roll: ffm.toListen to Seventeen: ffm.to CONNECT: wellesmusic.comfacebook.cominstagram.com@wellesmusic--ROCK N ROLLdoesn't eatdoesn't sleepdoes drugsjus shrugs & says leave me beROCK N ROLLis a ...
Thanos
2 weeks, 4 days ago
Views: 313 • Comments: 1 • Rating: 0
Tweets: 1 • Share to Facebook
Shares: 0
Comments: 0
: 0
WALK the MOON - Kamikaze (Official Video) Get WALK THE MOON's new album 'What If Nothing' featuring "Kamikaze" and "One Foot":iTunes - wtmband.comSpotify - wtmband.comApple Music - wtmband.comAmazon - wtmband.comGoogle Play - wtmband.comMerch Store - wtmband.com Catch WTM On Tour - wtmband.com Follow WALK THE MOON:walkthemoonband.comfacebook.com@WALKTHEMOONbandinstagram.com ...
Thanos
2 weeks, 4 days ago
Views: 827 • Comments: 0 • Rating: 0
Tweets: 1 • Share to Facebook
Shares: 0
Comments: 0
: 0
Trump Is Watching You Closer Than You Watch Him, Thanks NSA I would love to see my friends offer a string of sensible suggestions for protecting our meta data, our digital privacy as best we can in the comments. Technology cuts both ways. Corporation and governments have scale, money. We ...
Unshaken Defiance
2 weeks, 5 days ago
Views: 900 • Comments: 0 • Rating: 4
Tweets: 1 • Share to Facebook
Shares: 0
Comments: 0
: 0