Usage¶
Quick start using pywb, expects Google Chrome to be installed already:
pip install crocoite pywb
crocoite http://example.com/ example.com.warc.gz
wb-manager init test && wb-manager add test example.com.warc.gz
wayback &
$BROWSER http://localhost:8080
It is recommended to install at least Micrsoft’s Corefonts as well as DejaVu, Liberation or a similar font family covering a wide range of character sets. Otherwise page screenshots may be unusable due to missing glyphs.
Recursion¶
By default crocoite will only retrieve the URL specified on the command line. However it can follow links as well. There’s currently two recursion strategies available, depth- and prefix-based.
crocoite -r 1 https://example.com/ example.com.warc.gz
will retrieve example.com
and all pages directly refered to by it.
Increasing the number increases the depth, so a value of 2
would first grab
example.com
, queue all pages linked there as well as every reference on
each of those pages.
On the other hand
crocoite -r prefix https://example.com/dir/ example.com.warc.gz
will retrieve the URL specified and all pages referenced which have the same
URL prefix. There trailing slash is significant. Without it crocoite would also
grab /dir-something
or /dir.html
for example.
If an output file template is used each page is written to an individual file. For example
crocoite -r prefix https://example.com/ '{host}-{date}-{seqnum}.warc.gz'
will write one file page page to files like
example.com-2019-09-09T15:15:15+02:00-1.warc.gz
. seqnum
is unique to
each page of a single job and should always be used.
When running a recursive job, increasing the concurrency (i.e. how many pages
are fetched at the same time) can speed up the process. For example you can
pass -j
4
to retrieve four pages at the same time. Keep in mind
that each process starts a full browser that requires a lot of resources (one
to two GB of RAM and one or two CPU cores).
Customizing¶
Under the hood crocoite starts one instance of crocoite-single to fetch each page. You can customize its options by appending a command template like this:
crocoite -r prefix https://example.com example.com.warc.gz -- \
crocoite-single --timeout 5 -k '{url}' '{dest}'
This reduces the global timeout to 5 seconds and ignores TLS errors. If an
option is prefixed with an exclamation mark (!
) it will not be expanded.
This is useful for passing --warcinfo
, which expects JSON-encoded data.
Command line options¶
Below is a list of all command line arguments available:
crocoite¶
Front-end with recursion support and simple job management.
-
-j
N
,
--concurrency
N
¶ Maximum number of concurrent fetch jobs.
-
-r
POLICY
,
--recursion
POLICY
¶ Enables recursion based on POLICY, which can be a positive integer (recursion depth) or the string prefix.
-
--tempdir
DIR
¶ Directory for temporary WARC files.
crocoite-single¶
Back-end to fetch a single page.
-
-b
SET-COOKIE
,
--cookie
SET-COOKIE
¶ Add cookie to browser’s cookie jar. This option always appends cookies, replacing those provided by
-c
.New in version 1.1.
-
-c
FILE
,
--cookie-jar
FILE
¶ Load cookies from FILE. crocoite provides a default cookie file, which contains cookies to, for example, circumvent age restrictions. This option replaces that default file.
New in version 1.1.
-
--idle-timeout
SEC
¶ Time after which a page is considered “idle”.
-
-k
,
--insecure
¶
Allow insecure connections, i.e. self-signed ore expired HTTPS certificates.
-
--timeout
SEC
¶ Global archiving timeout.
-
--warcinfo
JSON
¶ Inject additional JSON-encoded information into the resulting WARC.
IRC bot¶
A simple IRC bot (“chromebot”) is provided with the command crocoite-irc.
It reads its configuration from a config file like the example provided in
contrib/chromebot.json
and supports the following commands:
- a <url> -j <concurrency> -r <policy> -k -b <set-cookie>
Archive <url> with <concurrency> processes according to recursion <policy>
- s <uuid>
Get job status for <uuid>
- r <uuid>
Revoke or abort running job with <uuid>