Penflip-header-logo@2x

Penflip updates, bug fixes, and ramblings

Nerding out: a funky image regex

An ongoing issue with Penflip is compiling projects for download (e.g. to PDF). The basic idea is simple and most of the time it works flawlessly, but with arbitrary user input there will be curveballs. And I see every single curveball imagineable.

Even with fantastic open source libraries like Pandoc, I still have quite a bit of code for pre-processing, for example: downloading images that are referenced in projects. Before converting to PDF, markdown files are converted to LaTeX. Image urls are parsed from the LaTeX for downloading. This was my regex:

/(https|http):\/\/.*?(jpeg|jpg|png|gif|bmp)/i

Look for an image url and stop at the first image extension. Simple enough. Turns out an image extension can be included twice in a url. In hindsight, it's obvious that this is possible from a technical standpoint, but I didn't consider it to be a likely scenario. So of course it happened. Here's a url that broke the regex:

http://upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Pachinko_balls.jpg/800px-Pachinko_balls.jpg

Note the two .jpg extensions. Ugh. My regex returned:

http://upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Pachinko_balls.jpg

Here's a better regex, using positive lookbehinds and lookaheads to capture the full url:

/(?<=\\includegraphics{)(https|http):\/\/\S*?\.(jpeg|jpg|png|gif|bmp)(?=})/i

Problem solved!