Idiomdrottning’s homepage

Comic Snarfer

comic-snarfer --start-page=[URL] --image-path=[XPATH] --next-path=[XPATH] --for-real

So I’m trying to release more of the stuff I write even if it’s somewhat, uh, “bespoke” stuff. (“Bespoke” is best backhanded compliment!)

This is a snarfer that trawls through a series of web pages. It saves images from them and then finds the link to the next page and recurses from there. It rips web comics, pretty much. It could also snarf other media (including just normal html pages) because it uses xpath to dispatch, not file endings.

This program is notoriously difficult and finicky to use and relies on you having a good understanding of html and writing xpath expressions.

That’s why it assumes you’re making an implicit “dry run” until you supply the argument --for-real. My advice is to hold off on that until the output for the first page looks right to you.

It’ll download directly to your current working directory, so make sure you are in a good clean empty place that you can fill with images.

There are three required options.

Just a plain URL to whatever the page you want to start at.
An xpath pointing to the main snarfable content of the page. If this matches multiple things (multiple images for example), they will all be saved.
An xpath pointing to the next page. If this matches multiple things, the snarfer follows the first one. If it doesn’t match anything, the snarfer terminates.

There are two non-required options.

The files are renamed to include their domain and their paths, because some webcomic sites just call it “comic.jpg” and depend on the paths to disambiguate. The snarfer also prefixes them with a number; this number is internal to the snarfer and just increments for everything it saves. It starts at the --start-issue number, but defaults at zero which is what you want most of the time. The point of this option is in case the snarfer crashed or was terminated and you want to resume with the same numbering.
The snarfer assumes a dry-run unless you supply this. I.e. without this flag, it only shows you what it would have saved from the first page, and it doesn’t follow any links. That’s good, because the xpaths can be finicky to get right.


The code is available at git clone

It’s FOSS (GPL3 or CC-BY-SA-4.0), and here’s how to install it on Debian and derivatives.

apt install chicken-bin
chicken-install anaphora getopt-long http-client html-parser mathh sxpath uri-common
csc -O5 comic-snarfer.scm

When you are hacking, remove the -O5 for clearer error messages.