November 9, 2012

Extracting a file extension from a URL

As part of the development for imageGet, I needed to extract a file extension from a supplied URL. Specifically, I needed to pull extensions that could possibly be images (although for future compatibility's sake, I did not want to explicitly list them in the search itself). Determining a file type by looking at its extension is workable, but as file extension is no guarantee of file type, you do need to follow up with MIME-type checking afterwards.

Since I'm on my regex streak, I ended up doing it with regular expressions, because why not?

var strA = 'http://www.feedseed.com/image.jpg',
    strB = '.com/image.jpg?abcdef',
    strC = 'config.inc.php',
    strD = 'image.jpg#lolol',
    strE = 'feedseed.com/.htaccess',
    regex = /\.([a-zA-z]{3,4})(?:[\?#].+)?$/;

    console.log(strA.match(regex));
    console.log(strB.match(regex));
    console.log(strC.match(regex));
    console.log(strD.match(regex));
    console.log(strE.match(regex), 'No match because .htaccess is > 4 characters, not a valid image extension');

... or alternatively, in PHP:

<?php
    $url = 'http://domain.tld/image.jpg?queryString#HashAsWell';
    preg_match('/\.([a-zA-z]{3,4})(?:[\?#].+)?$/', $url, $ext);
    $ext = $ext[1];
?>

Play around with the fiddle here. The expression also accounts for querystrings and hashes being present, and only retrieves the trailing letters after a dot, so it wouldn't match .com in domain.com.

A non-regex solution would involve taking all content after the trailing period, and cutting off everything after the first non-alphabetical character. Given that regex is slow1, a regex solution might end up being slower than a non-regex solution. I'd love to see competing functions tackle this one!

====

1 ... compared to simple string search. So I've been told — I haven't actually done any benchmarking to see whether or not that accusation is unfounded of not.