Tech Note: A Regular Expression to Extract the ASIN Product Code From Any Amazon URL
Every once in a while I come up with a bit of code that might be useful to other programmers, and here’s one of those bits — a regular expression that extracts the ASIN product code from almost any Amazon product URL, including URLs from non-US Amazon stores. (It doesn’t handle pages of search results or other index pages, however.)
I use this at LGF to rewrite links to Amazon and add our affiliate ID to the link, so if someone clicks the link we get a small percentage of the purchase price when they buy something.
This code used to be a series of regular expressions that matched each type of URL, but I was looking at it this week and realized it might be possible to condense the whole process down to just one regex. And after extensive testing, I think I may have gotten pretty close to a universal Amazon ASIN extractor. (I searched Google, and couldn’t find anything this good online.)
In this PHP code, I used the x modifier to let me split the regex across multiple lines with comments and indenting. I also used ~ for the regular expression delimiter so all the forward slashes don’t need to be escaped. These two steps make the code much more readable, but if you use this regex with a language that doesn’t allow you to change the delimiter you’ll need to insert a backslash before each forward slash.
(Notice that I used non-capturing groups for everything except the ASIN product ID.)
$regex = '~
(?:www\.)? # optionally starts with www.
ama?zo?n\. # also allow shortened amzn.com URLs
(?:
com # match all Amazon domains
|
ca
|
co\.uk
|
co\.jp
|
de
|
fr
)
/
(?: # here comes the stuff before the ASIN
exec/obidos/ASIN/ # the possible components of a URL
|
o/
|
gp/product/
|
(?: # the dp/ format may contain a title
(?:[^"\'/]*)/ # anything but a slash or quote
)? # optional
dp/
| # if short format, nothing before ASIN
)
([A-Z0-9]{10}) # capture group $1 contains the ASIN
(?: # everything after the ASIN
(?:/|\?|\#) # beginning with /, ? or #
(?:[^"\'\s]*) # everything up to quote or white space
)? # optional
~isx';
Here’s a bit of code that puts this regex to work and inserts our affiliate ID into the reconstructed link:
$text = preg_replace($regex, 'www.amazon.com/dp/$1/?tag=littlegreenfo-20', $text);
And here’s what this regular expression looks like when condensed down to one gnarly line:
$regex = '~(?:www\.)?ama?zo?n\.(?:com|ca|co\.uk|co\.jp|de|fr)/(?:exec/obidos/ASIN/|o/|gp/product/|(?:(?:[^"\'/]*)/)?dp/|)(B[A-Z0-9]{9})(?:(?:/|\?|\#)(?:[^"\'\s]*))?~isx';
If you run this regular expression against the following URL:
http://www.amazon.com/Man-High-Castle/dp/B00RSGFRY8/ref=sr_1_1?s=instant-video&ie=UTF8&qid=1421879835&sr=1-1&keywords=the%20man%20in%20the%20high%20castle
You end up with this:
http://www.amazon.com/dp/B00RSGFRY8/?tag=littlegreenfo-20