Usage

crocoite.cli.single()[source]

One-shot command line interface and pywb playback:

pip install pywb
crocoite-grab http://example.com/ example.com.warc.gz
rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
wayback &
$BROWSER http://localhost:8080

Recursion

crocoite.cli.recursive()[source]

crocoite is built with the Unix philosophy (“do one thing and do it well”) in mind. Thus crocoite-grab can only save a single page. If you want recursion use crocoite-recursive, which follows hyperlinks according to --policy. It can either recurse a maximum number of levels or grab all pages with the same prefix as the start URL:

crocoite-recursive --policy prefix http://www.example.com/dir/ output

will save all pages in /dir/ and below to individual files in the output directory output. You can customize the command used to grab individual pages by appending it after output. This way distributed grabs (ssh to a different machine and execute the job there, queue the command with Slurm, …) are possible.

IRC bot

crocoite.cli.irc()[source]

A simple IRC bot (“chromebot”) is provided with the command crocoite-irc. It reads its configuration from a config file like the example provided in contrib/chromebot.json and supports the following commands:

a <url> -j <concurrency> -r <policy>
Archive <url> with <concurrency> processes according to recursion <policy>
s <uuid>
Get job status for <uuid>
r <uuid>
Revoke or abort running job with <uuid>