Archive websites using headless Google Chrome and its DevTools protocol.
The following dependencies must be present to run crocoite:
It is recommended to prepare a virtualenv and let pip handle the dependency resolution for Python packages instead:
cd crocoite virtualenv -p python3 sandbox source sandbox/bin/activate pip install .
One-shot commandline interface and pywb playback:
crocoite-grab --output example.com.warc.gz http://example.com/ rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz wayback & $BROWSER http://localhost:8080
The naive approach of fetching a HTML page, parsing it and extracting links to referenced resources therefore is not sufficient to create a faithful snapshot of these web applications. A full browser, capable of running scripts and providing modern Web API’s is absolutely required for this task. Thankfully Google Chrome runs without a display (headless mode) and can be controlled by external programs, allowing them to navigate and extract or inject data. This section describes the solutions crocoite offers and explains design decisions taken.
crocoite captures resources by listening to Chrome’s network events and requesting the response body using Network.getResponseBody. This approach has caveats: The original HTTP requests and responses, as sent over the wire, are not available. They are reconstructed from parsed data. The character encoding for text documents is changed to UTF-8. And the content body of HTTP redirects cannot be retrieved due to a race condition.
But at the same time it allows crocoite to rely on Chrome’s well-tested network stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as transport protocols like SSL and QUIC. Depending on Chrome also eliminates the need for a man-in-the-middle proxy, like warcprox, which has to decrypt SSL traffic and present a fake certificate to the browser in order to store the transmitted content.
WARC records generated by crocoite therefore are an abstract view on the resource they represent and not necessarily the data sent over the wire. A URL fetched with HTTP/2 for example will still result in a HTTP/1.1 request/response pair in the WARC file. This may be undesireable from an archivist’s point of view (“save the data exactly like we received it”). But this level of abstraction is inevitable when dealing with more than one protocol.
Replaying archived WARC’s can be quite challenging and might not be possible with current technology (or even at all):
crocoite is built with the Unix philosophy (“do one thing and do it well”) in mind. Thus crocoite-grab can only save a single page. If you want recursion use crocoite-recursive, which follows hyperlinks according to --policy. It can either recurse a maximum number of levels or grab all pages with the same prefix as the start URL:
crocoite-recursive --policy prefix http://www.example.com/dir/ output
will save all pages in /dir/ and below to individual files in the output directory output. You can customize the command used to grab individual pages by appending it after output. This way distributed grabs (ssh to a different machine and execute the job there, queue the command with Slurm, …) are possible.
A simple IRC bot (“chromebot”) is provided with the command crocoite-irc. It reads its configuration from a config file like the example provided in contrib/chromebot.ini and supports the following commands: