40 files changed, 3143 insertions, 1195 deletions
diff --git a/.travis.yml b/.travis.yml
index c687962..b1d417c 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -1,11 +1,17 @@
 dist: xenial
 language: python
-python:
-    - "3.6"
-    - "3.6-dev"
-    - "3.7"
-    - "3.7-dev"
-    - "3.8-dev"
+matrix:
+  include:
+    - python: "3.6"
+    - python: "3.7"
+    - python: "3.8"
+    - python: "3.6-dev"
+    - python: "3.7-dev"
+    - python: "3.8-dev"
+  allow_failures:
+    - python: "3.6-dev"
+    - python: "3.7-dev"
+    - python: "3.8-dev"
 install:
     - pip install .
 script:
diff --git a/README.rst b/README.rst
index c604d81..b1d9e5b 100644
--- a/README.rst
+++ b/README.rst
@@ -1,211 +1,15 @@
 crocoite
 ========
 
-Preservation for the modern web, powered by `headless Google
-Chrome`_.
-
-.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master
-    :target: https://travis-ci.org/PromyLOPh/crocoite
-
-.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg
-  :target: https://codecov.io/gh/PromyLOPh/crocoite
-
-.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
-
-Quick start
------------
-
-These dependencies must be present to run crocoite:
-
-- Python ≥3.6
-- PyYAML_
-- aiohttp_
-- websockets_
-- warcio_
-- html5lib_
-- bottom_ (IRC client)
-- `Google Chrome`_
-
-.. _PyYAML: https://pyyaml.org/wiki/PyYAML
-.. _aiohttp: https://aiohttp.readthedocs.io/
-.. _websockets: https://websockets.readthedocs.io/
-.. _warcio: https://github.com/webrecorder/warcio
-.. _html5lib: https://github.com/html5lib/html5lib-python
-.. _bottom: https://github.com/numberoverzero/bottom
-.. _Google Chrome: https://www.google.com/chrome/
-
-The following commands clone the repository from GitHub_, set up a virtual
-environment and install crocoite:
-
-.. _GitHub: https://github.com/PromyLOPh/crocoite
-
 .. code:: bash
 
-    git clone https://github.com/PromyLOPh/crocoite.git
-    cd crocoite
-    virtualenv -p python3 sandbox
-    source sandbox/bin/activate
-    pip install .
-
-One-shot command line interface and pywb_ playback:
-
-.. code:: bash
-
-    pip install pywb
-    crocoite-grab http://example.com/ example.com.warc.gz
-    rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
+    pip install crocoite pywb
+    crocoite http://example.com/ example.com.warc.gz
+    wb-manager init test && wb-manager add test example.com.warc.gz
     wayback &
     $BROWSER http://localhost:8080
 
-.. _pywb: https://github.com/ikreymer/pywb
-
-Rationale
----------
-
-Most modern websites depend heavily on executing code, usually JavaScript, on
-the user’s machine. They also make use of new and emerging Web technologies
-like HTML5, WebSockets, service workers and more. Even worse from the
-preservation point of view, they also require some form of user interaction to
-dynamically load more content (infinite scrolling, dynamic comment loading,
-etc).
-
-The naive approach of fetching a HTML page, parsing it and extracting
-links to referenced resources therefore is not sufficient to create a faithful
-snapshot of these web applications. A full browser, capable of running scripts and
-providing modern Web API’s is absolutely required for this task. Thankfully
-Google Chrome runs without a display (headless mode) and can be controlled by
-external programs, allowing them to navigate and extract or inject data.
-This section describes the solutions crocoite offers and explains design
-decisions taken.
-
-crocoite captures resources by listening to Chrome’s `network events`_ and
-requesting the response body using `Network.getResponseBody`_. This approach
-has caveats: The original HTTP requests and responses, as sent over the wire,
-are not available. They are reconstructed from parsed data. The character
-encoding for text documents is changed to UTF-8. And the content body of HTTP
-redirects cannot be retrieved due to a race condition.
-
-.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
-.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
-
-But at the same time it allows crocoite to rely on Chrome’s well-tested network
-stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
-transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
-need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
-traffic and present a fake certificate to the browser in order to store the
-transmitted content.
-
-.. _warcprox: https://github.com/internetarchive/warcprox
-
-WARC records generated by crocoite therefore are an abstract view on the
-resource they represent and not necessarily the data sent over the wire. A URL
-fetched with HTTP/2 for example will still result in a HTTP/1.1
-request/response pair in the WARC file. This may be undesireable from
-an archivist’s point of view (“save the data exactly like we received it”). But
-this level of abstraction is inevitable when dealing with more than one
-protocol.
-
-crocoite also interacts with and therefore alters the grabbed websites. It does
-so by injecting `behavior scripts`_ into the site. Typically these are written
-in JavaScript, because interacting with a page is easier this way. These
-scripts then perform different tasks: Extracting targets from visible
-hyperlinks, clicking buttons or scrolling the website to to load more content,
-as well as taking a static screenshot of ``<canvas>`` elements for the DOM
-snapshot (see below).
-
-.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
-
-Replaying archived WARC’s can be quite challenging and might not be possible
-with current technology (or even at all):
-
-- Some sites request assets based on screen resolution, pixel ratio and
-  supported image formats (webp). Replaying those with different parameters
-  won’t work, since assets for those are missing. Example: missguided.com.
-- Some fetch different scripts based on user agent. Example: youtube.com.
-- Requests containing randomly generated JavaScript callback function names
-  won’t work. Example: weather.com.
-- Range requests (Range: bytes=1-100) are captured as-is, making playback
-  difficult
-
-crocoite offers two methods to work around these issues. Firstly it can save a
-DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
-``<script>`` tags after the site has been fully loaded and thus can be
-displayed without executing scripts.  Obviously JavaScript-based navigation
-does not work any more. Secondly it also saves a screenshot of the full page,
-so even if future browsers cannot render and display the stored HTML a fully
-rendered version of the website can be replayed instead.
-
-Advanced usage
---------------
-
-crocoite is built with the Unix philosophy (“do one thing and do it well”) in
-mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion
-use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``.
-It can either recurse a maximum number of levels or grab all pages with the
-same prefix as the start URL:
-
-.. code:: bash
-
-   crocoite-recursive --policy prefix http://www.example.com/dir/ output
-
-will save all pages in ``/dir/`` and below to individual files in the output
-directory ``output``. You can customize the command used to grab individual
-pages by appending it after ``output``. This way distributed grabs (ssh to a
-different machine and execute the job there, queue the command with Slurm, …)
-are possible.
-
-IRC bot
-^^^^^^^
-
-A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``.
-It reads its configuration from a config file like the example provided in
-``contrib/chromebot.ini`` and supports the following commands:
-
-a <url> -j <concurrency> -r <policy>
-    Archive <url> with <concurrency> processes according to recursion <policy>
-s <uuid>
-    Get job status for <uuid>
-r <uuid>
-    Revoke or abort running job with <uuid>
-
-Browser configuration
-^^^^^^^^^^^^^^^^^^^^^
-
-Generally crocoite provides reasonable defaults for Google Chrome via its
-`devtools module`_. When debugging this software it might be necessary to open
-a non-headless instance of the browser by running
-
-.. code:: bash
-
-   google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs
-
-and then passing the option ``--browser=http://localhost:9222`` to
-``crocoite-grab``. This allows human intervention through the browser’s builtin
-console.
-
-Another issue that might arise is related to fonts. Headless servers usually
-don’t have them installed by default and thus rendered screenshots may contain
-replacement characters (□) instead of the actual text. This affects mostly
-non-latin character sets.  It is therefore recommended to install at least
-Micrsoft’s Corefonts_ as well as DejaVu_, Liberation_ or a similar font family
-covering a wide range of character sets.
-
-.. _devtools module: crocoite/devtools.py
-.. _Corefonts: http://corefonts.sourceforge.net/
-.. _DejaVu: https://dejavu-fonts.github.io/
-.. _Liberation: https://pagure.io/liberation-fonts
-
-Related projects
-----------------
-
-brozzler_
-    Uses Google Chrome as well, but intercepts traffic using a proxy. Supports
-    distributed crawling and immediate playback.
-Squidwarc_
-    Communicates with headless Google Chrome and uses the Network API to
-    retrieve requests like crocoite. Supports recursive crawls and page
-    scrolling, but neither custom JavaScript nor distributed crawling.
+See documentation_ for more information.
 
-.. _brozzler: https://github.com/internetarchive/brozzler
-.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc
+.. _documentation: https://6xq.net/crocoite/
 
diff --git a/contrib/chromebot.ini b/contrib/chromebot.ini
deleted file mode 100644
index a302356..0000000
--- a/contrib/chromebot.ini
+++ /dev/null
@@ -1,10 +0,0 @@
-[irc]
-host = irc.example.com
-port = 6667
-ssl = False
-tempdir = /path/to/warc
-destdir = /path/to/tmp
-nick = chromebot
-channel = #testchannel
-process_limit = 1
-
diff --git a/contrib/chromebot.json b/contrib/chromebot.json
new file mode 100644
index 0000000..214b770
--- /dev/null
+++ b/contrib/chromebot.json
@@ -0,0 +1,16 @@
+{
+  "irc": {
+    "host": "irc.example.com",
+    "port": 6667,
+    "ssl": false,
+    "nick": "chromebot",
+    "channels": ["#testchannel"]
+  },
+  "tempdir": "/path/to/tmp",
+  "destdir": "/path/to/warc",
+  "process_limit": 1
+  "blacklist": {
+    "^https?://(.+\\.)?local(host)?/": "Not acceptable"
+  },
+  "need_voice": false
+}
diff --git a/contrib/dashboard.html b/contrib/dashboard.html
index cc09d50..49a15bc 100644
--- a/contrib/dashboard.html
+++ b/contrib/dashboard.html
@@ -4,7 +4,7 @@
 		<meta charset="utf-8">
 		 <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
         <title>chromebot dashboard</title>
-		<!--<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js"></script>-->
+		<!--<script src="https://cdn.jsdelivr.net/npm/vue@2/dist/vue.js"></script>-->
 		<script src="https://cdn.jsdelivr.net/npm/vue@2/dist/vue.min.js"></script>
 		<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.7/css/bulma.min.css">
 		<link rel="stylesheet" href="dashboard.css">
@@ -13,8 +13,9 @@
 		<noscript>Please enable JavaScript.</noscript>
 		<section id="app" class="section">
 		<h1 class="title">chromebot dashboard</h1>
+		<bot-status v-bind:jobs="jobs"></bot-status>
 		<div class="jobs">
-			<job-item v-for="j in jobs" v-bind:job="j" v-bind:jobs="jobs" v-bind:ignored="ignored" v-bind:key="j.id"></job-item>
+			<job-item v-for="j in jobs" v-bind:job="j" v-bind:jobs="jobs" v-bind:key="j.id"></job-item>
 		</div>
 		</section>
 		<script src="dashboard.js"></script>
diff --git a/contrib/dashboard.js b/contrib/dashboard.js
index eb34d43..b5520dc 100644
--- a/contrib/dashboard.js
+++ b/contrib/dashboard.js
@@ -1,5 +1,5 @@
 /* configuration */
-let socket = "ws://localhost:6789/",
+let socket = "wss://localhost:6789/",
 	urllogMax = 100;
 
 function formatSize (bytes) {
@@ -35,19 +35,12 @@ class Job {
 }
 
 let jobs = {};
-/* list of ignored job ids, i.e. those the user deleted from the dashboard */
-let ignored = [];
 let ws = new WebSocket(socket);
 ws.onmessage = function (event) {
 	var msg = JSON.parse (event.data);
 	let msgdate = new Date (Date.parse (msg.date));
 	var j = undefined;
-	console.log (msg);
 	if (msg.job) {
-		if (ignored.includes (msg.job)) {
-			console.log ("job ignored", msg.job);
-			return;
-		}
 		j = jobs[msg.job];
 		if (j === undefined) {
 			j = new Job (msg.job, 'unknown', '<unknown>', new Date ());
@@ -79,7 +72,7 @@ ws.onmessage = function (event) {
 		} else if (rmsg.uuid == '5b8498e4-868d-413c-a67e-004516b8452c') {
 			/* recursion status */
 			Object.assign (j.stats, rmsg);
-		} else if (rmsg.uuid == '1680f384-744c-4b8a-815b-7346e632e8db') {
+		} else if (rmsg.uuid == 'd1288fbe-8bae-42c8-af8c-f2fa8b41794f') {
 			/* fetch */
 			j.addUrl (rmsg.url);
 		}
@@ -91,14 +84,8 @@ ws.onerror = function (event) {
 };
 
 Vue.component('job-item', {
-  props: ['job', 'jobs', 'ignored'],
-  template: '<div class="job box" :id="job.id"><ul class="columns"><li class="jid column is-narrow"><a :href="\'#\' + job.id">{{ job.id }}</a></li><li class="url column"><a :href="job.url">{{ job.url }}</a></li><li class="status column is-narrow"><job-status v-bind:job="job"></job-status></li><li class="column is-narrow"><a class="delete" v-on:click="del(job.id)"></a></li></ul><job-stats v-bind:job="job"></job-stats></div>',
-  methods: {
-  	del: function (id) {
-		Vue.delete(this.jobs, id);
-		this.ignored.push (id);
-	}
-	}
+  props: ['job', 'jobs'],
+  template: '<div class="job box" :id="job.id"><ul class="columns"><li class="jid column is-narrow"><a :href="\'#\' + job.id">{{ job.id }}</a></li><li class="url column"><a :href="job.url">{{ job.url }}</a></li><li class="status column is-narrow"><job-status v-bind:job="job"></job-status></li></ul><job-stats v-bind:job="job"></job-stats></div>',
 });
 Vue.component('job-status', {
 	props: ['job'],
@@ -117,6 +104,21 @@ Vue.component('filesize', {
   template: '<span class="filesize">{{ fvalue }}</span>',
   computed: { fvalue: function () { return formatSize (this.value); } }
 });
+Vue.component('bot-status', {
+	props: ['jobs'],
+	template: '<nav class="level"><div class="level-item has-text-centered"><div><p class="heading">Pending</p><p class="title">{{ stats.pending }}</p></div></div><div class="level-item has-text-centered"><div><p class="heading">Running</p><p class="title">{{ stats.running }}</p></div></div><div class="level-item has-text-centered"><div><p class="heading">Finished</p><p class="title">{{ stats.finished+stats.aborted }}</p></div></div><div class="level-item has-text-centered"><div><p class="heading">Transferred</p><p class="title"><filesize v-bind:value="stats.totalBytes"></filesize></p></div></div></nav>',
+	computed: {
+		stats: function () {
+			let s = {pending: 0, running: 0, finished: 0, aborted: 0, totalBytes: 0};
+			for (let k in this.jobs) {
+				let j = this.jobs[k];
+				s[j.status]++;
+				s.totalBytes += j.stats.bytesRcv;
+			}
+			return s;
+		}
+	}
+});
 
 let app = new Vue({
   el: '#app',
diff --git a/crocoite/behavior.py b/crocoite/behavior.py
index eb5478b..1610751 100644
--- a/crocoite/behavior.py
+++ b/crocoite/behavior.py
@@ -35,35 +35,41 @@ instance.
 """
 
 import asyncio, json, os.path
-from urllib.parse import urlsplit
 from base64 import b64decode
 from collections import OrderedDict
 import pkg_resources
 
 from html5lib.serializer import HTMLSerializer
+from yarl import URL
 import yaml
 
-from .util import getFormattedViewportMetrics, removeFragment
+from .util import getFormattedViewportMetrics
 from . import html
 from .html import StripAttributeFilter, StripTagFilter, ChromeTreeWalker
-from .devtools import Crashed
+from .devtools import Crashed, TabException
 
 class Script:
     """ A JavaScript resource """
 
     __slots__ = ('path', 'data')
+    datadir = 'data'
 
     def __init__ (self, path=None, encoding='utf-8'):
         self.path = path
         if path:
-            self.data = pkg_resources.resource_string (__name__, os.path.join ('data', path)).decode (encoding)
+            self.data = pkg_resources.resource_string (__name__, os.path.join (self.datadir, path)).decode (encoding)
 
     def __repr__ (self):
-        return '<Script {}>'.format (self.path)
+        return f'<Script {self.path}>'
 
     def __str__ (self):
         return self.data
 
+    @property
+    def abspath (self):
+        return pkg_resources.resource_filename (__name__,
+                os.path.join (self.datadir, self.path))
+
     @classmethod
     def fromStr (cls, data, path=None):
         s = Script ()
@@ -89,33 +95,23 @@ class Behavior:
         return True
 
     def __repr__ (self):
-        return '<Behavior {}>'.format (self.name)
+        return f'<Behavior {self.name}>'
 
     async def onload (self):
         """ After loading the page started """
         # this is a dirty hack to make this function an async generator
         return
-        yield
+        yield # pragma: no cover
 
     async def onstop (self):
         """ Before page loading is stopped """
         return
-        yield
+        yield # pragma: no cover
 
     async def onfinish (self):
         """ After the site has stopped loading """
         return
-        yield
-
-class HostnameFilter:
-    """ Limit behavior script to hostname """
-
-    hostname = None
-
-    def __contains__ (self, url):
-        url = urlsplit (url)
-        hostname = url.hostname.split ('.')[::-1]
-        return hostname[:2] == self.hostname
+        yield # pragma: no cover
 
 class JsOnload (Behavior):
     """ Execute JavaScript on page load """
@@ -141,6 +137,8 @@ class JsOnload (Behavior):
         # parameter.
         # XXX: is there a better way to do this?
         result = await tab.Runtime.evaluate (expression=str (self.script))
+        self.logger.debug ('behavior onload inject',
+                uuid='a2da9b78-5648-44c5-bfa8-5c7573e13ad3', result=result)
         exception = result.get ('exceptionDetails', None)
         result = result['result']
         assert result['type'] == 'function', result
@@ -148,23 +146,45 @@ class JsOnload (Behavior):
         constructor = result['objectId']
 
         if self.options:
-            yield Script.fromStr (json.dumps (self.options, indent=2), '{}/options'.format (self.script.path))
-        result = await tab.Runtime.callFunctionOn (
-                functionDeclaration='function(options){return new this(options);}',
-                objectId=constructor,
-                arguments=[{'value': self.options}])
-        result = result['result']
-        assert result['type'] == 'object', result
-        assert result.get ('subtype') != 'error', result
-        self.context = result['objectId']
+            yield Script.fromStr (json.dumps (self.options, indent=2), f'{self.script.path}#options')
+
+        try:
+            result = await tab.Runtime.callFunctionOn (
+                    functionDeclaration='function(options){return new this(options);}',
+                    objectId=constructor,
+                    arguments=[{'value': self.options}])
+            self.logger.debug ('behavior onload start',
+                    uuid='6c0605ae-93b3-46b3-b575-ba45790909a7', result=result)
+            result = result['result']
+            assert result['type'] == 'object', result
+            assert result.get ('subtype') != 'error', result
+            self.context = result['objectId']
+        except TabException as e:
+            if e.args[0] == -32000:
+                # the site probably reloaded. ignore this, since we’ll be
+                # re-injected into the new site by the controller.
+                self.logger.error ('jsonload onload failed',
+                        uuid='c151a863-78d1-41f4-a8e6-c022a6c5d252',
+                        exception=e.args)
+            else:
+                raise
 
     async def onstop (self):
         tab = self.loader.tab
-        assert self.context is not None
-        await tab.Runtime.callFunctionOn (functionDeclaration='function(){return this.stop();}', objectId=self.context)
-        await tab.Runtime.releaseObject (objectId=self.context)
+        try:
+            assert self.context is not None
+            await tab.Runtime.callFunctionOn (functionDeclaration='function(){return this.stop();}',
+                    objectId=self.context)
+            await tab.Runtime.releaseObject (objectId=self.context)
+        except TabException as e:
+            # cannot do anything about that. Ignoring should be fine.
+            self.logger.error ('jsonload onstop failed',
+                    uuid='1786726f-c8ec-4f79-8769-30954d4e32f5',
+                    exception=e.args,
+                    objectId=self.context)
+
         return
-        yield
+        yield # pragma: no cover
 
 ### Generic scripts ###
 
@@ -195,18 +215,25 @@ class EmulateScreenMetrics (Behavior):
         l = self.loader
         tab = l.tab
         for s in sizes:
+            self.logger.debug ('device override',
+                    uuid='3d2d8096-1a75-4830-ad79-ae5f6f97071d', **s)
             await tab.Emulation.setDeviceMetricsOverride (**s)
             # give the browser time to re-eval page and start requests
             # XXX: should wait until loader is not busy any more
             await asyncio.sleep (1)
+        self.logger.debug ('clear override',
+                uuid='f9401683-eb3a-4b86-9bb2-c8c5d876fc8d')
         await tab.Emulation.clearDeviceMetricsOverride ()
         return
-        yield
+        yield # pragma: no cover
 
 class DomSnapshotEvent:
     __slots__ = ('url', 'document', 'viewport')
 
     def __init__ (self, url, document, viewport):
+        # XXX: document encoding?
+        assert isinstance (document, bytes)
+
         self.url = url
         self.document = document
         self.viewport = viewport
@@ -235,18 +262,21 @@ class DomSnapshot (Behavior):
 
         viewport = await getFormattedViewportMetrics (tab)
         dom = await tab.DOM.getDocument (depth=-1, pierce=True)
+        self.logger.debug ('dom snapshot document',
+                uuid='0c720784-8bd1-4fdc-a811-84394d753539', dom=dom)
         haveUrls = set ()
         for doc in ChromeTreeWalker (dom['root']).split ():
-            rawUrl = doc['documentURL']
-            if rawUrl in haveUrls:
+            url = URL (doc['documentURL'])
+            if url in haveUrls:
                 # ignore duplicate URLs. they are usually caused by
                 # javascript-injected iframes (advertising) with no(?) src
-                self.logger.warning ('have DOM snapshot for URL {}, ignoring'.format (rawUrl))
-                continue
-            url = urlsplit (rawUrl)
-            if url.scheme in ('http', 'https'):
-                self.logger.debug ('saving DOM snapshot for url {}, base {}'.format (doc['documentURL'], doc['baseURL']))
-                haveUrls.add (rawUrl)
+                self.logger.warning ('dom snapshot duplicate',
+                        uuid='d44de989-98d4-456e-82e7-9d4c49acab5e')
+            elif url.scheme in ('http', 'https'):
+                self.logger.debug ('dom snapshot',
+                        uuid='ece7ff05-ccd9-44b5-b6a8-be25a24b96f4',
+                        base=doc["baseURL"])
+                haveUrls.add (url)
                 walker = ChromeTreeWalker (doc)
                 # remove script, to make the page static and noscript, because at the
                 # time we took the snapshot scripts were enabled
@@ -254,7 +284,7 @@ class DomSnapshot (Behavior):
                 disallowedAttributes = html.eventAttributes
                 stream = StripAttributeFilter (StripTagFilter (walker, disallowedTags), disallowedAttributes)
                 serializer = HTMLSerializer ()
-                yield DomSnapshotEvent (removeFragment (doc['documentURL']), serializer.render (stream, 'utf-8'), viewport)
+                yield DomSnapshotEvent (url.with_fragment(None), serializer.render (stream, 'utf-8'), viewport)
 
 class ScreenshotEvent:
     __slots__ = ('yoff', 'data', 'url')
@@ -267,35 +297,77 @@ class ScreenshotEvent:
 class Screenshot (Behavior):
     """
     Create screenshot from tab and write it to WARC
+
+    Chrome will allocate an additional 512MB of RAM when using this plugin.
     """
 
+    __slots__ = ('script')
+
     name = 'screenshot'
 
+    # Hardcoded max texture size of 16,384 (crbug.com/770769)
+    maxDim = 16*1024
+
+    def __init__ (self, loader, logger):
+        super ().__init__ (loader, logger)
+        self.script = Script ('screenshot.js')
+
     async def onfinish (self):
         tab = self.loader.tab
 
+        # for top-level/full-screen elements with position: fixed we need to
+        # figure out their actual size (i.e. scrollHeight) and use that when
+        # overriding the viewport size.
+        # we could do this without javascript, but that would require several
+        # round-trips to Chrome or pulling down the entire DOM+computed styles
+        tab = self.loader.tab
+        yield self.script
+        result = await tab.Runtime.evaluate (expression=str (self.script), returnByValue=True)
+        assert result['result']['type'] == 'object', result
+        result = result['result']['value']
+
+        # this is required to make the browser render more than just the small
+        # actual viewport (i.e. entire page).  see
+        # https://github.com/GoogleChrome/puppeteer/blob/45873ea737b4ebe4fa7d6f46256b2ea19ce18aa7/lib/Page.js#L805
+        metrics = await tab.Page.getLayoutMetrics ()
+        contentSize = metrics['contentSize']
+        contentHeight = max (result + [contentSize['height']])
+
+        override = {
+                'width': 0,
+                'height': 0,
+                'deviceScaleFactor': 0,
+                'mobile': False,
+                'viewport': {'x': 0,
+                    'y': 0,
+                    'width': contentSize['width'],
+                    'height': contentHeight,
+                    'scale': 1}
+                }
+        self.logger.debug ('screenshot override',
+                uuid='e0affa18-cbb1-4d97-9d13-9a88f704b1b2', override=override)
+        await tab.Emulation.setDeviceMetricsOverride (**override)
+
         tree = await tab.Page.getFrameTree ()
         try:
-            url = removeFragment (tree['frameTree']['frame']['url'])
+            url = URL (tree['frameTree']['frame']['url']).with_fragment (None)
         except KeyError:
-            self.logger.error ('frame without url', tree=tree)
+            self.logger.error ('frame without url',
+                    uuid='edc2743d-b93e-4ba1-964e-db232f2f96ff', tree=tree)
             url = None
 
-        # see https://github.com/GoogleChrome/puppeteer/blob/230be28b067b521f0577206899db01f0ca7fc0d2/examples/screenshots-longpage.js
-        # Hardcoded max texture size of 16,384 (crbug.com/770769)
-        maxDim = 16*1024
-        metrics = await tab.Page.getLayoutMetrics ()
-        contentSize = metrics['contentSize']
-        width = min (contentSize['width'], maxDim)
+        width = min (contentSize['width'], self.maxDim)
         # we’re ignoring horizontal scroll intentionally. Most horizontal
         # layouts use JavaScript scrolling and don’t extend the viewport.
-        for yoff in range (0, contentSize['height'], maxDim):
-            height = min (contentSize['height'] - yoff, maxDim)
+        for yoff in range (0, contentHeight, self.maxDim):
+            height = min (contentHeight - yoff, self.maxDim)
             clip = {'x': 0, 'y': yoff, 'width': width, 'height': height, 'scale': 1}
             ret = await tab.Page.captureScreenshot (format='png', clip=clip)
             data = b64decode (ret['data'])
             yield ScreenshotEvent (url, yoff, data)
 
+        await tab.Emulation.clearDeviceMetricsOverride ()
+
 class Click (JsOnload):
     """ Generic link clicking """
 
@@ -305,7 +377,7 @@ class Click (JsOnload):
     def __init__ (self, loader, logger):
         super ().__init__ (loader, logger)
         with pkg_resources.resource_stream (__name__, os.path.join ('data', 'click.yaml')) as fd:
-            self.options['sites'] = list (yaml.load_all (fd))
+            self.options['sites'] = list (yaml.safe_load_all (fd))
 
 class ExtractLinksEvent:
     __slots__ = ('links', )
@@ -313,6 +385,16 @@ class ExtractLinksEvent:
     def __init__ (self, links):
         self.links = links
 
+    def __repr__ (self):
+        return f'<ExtractLinksEvent {self.links!r}>'
+
+def mapOrIgnore (f, l):
+    for e in l:
+        try:
+            yield f (e)
+        except:
+            pass
+
 class ExtractLinks (Behavior):
     """
     Extract links from a page using JavaScript
@@ -333,7 +415,7 @@ class ExtractLinks (Behavior):
         tab = self.loader.tab
         yield self.script
         result = await tab.Runtime.evaluate (expression=str (self.script), returnByValue=True)
-        yield ExtractLinksEvent (list (set (result['result']['value'])))
+        yield ExtractLinksEvent (list (set (mapOrIgnore (URL, result['result']['value']))))
 
 class Crash (Behavior):
     """ Crash the browser. For testing only. Obviously. """
@@ -346,7 +428,7 @@ class Crash (Behavior):
         except Crashed:
             pass
         return
-        yield
+        yield # pragma: no cover
 
 # available behavior scripts. Order matters, move those modifying the page
 # towards the end of available
diff --git a/crocoite/browser.py b/crocoite/browser.py
index c472746..3518789 100644
--- a/crocoite/browser.py
+++ b/crocoite/browser.py
@@ -23,84 +23,197 @@ Chrome browser interactions.
 """
 
 import asyncio
-from urllib.parse import urlsplit
-from base64 import b64decode
+from base64 import b64decode, b64encode
+from datetime import datetime, timedelta
 from http.server import BaseHTTPRequestHandler
 
+from yarl import URL
+from multidict import CIMultiDict
+
 from .logger import Level
 from .devtools import Browser, TabException
 
-class Item:
-    """
-    Simple wrapper containing Chrome request and response
-    """
+# These two classes’ only purpose is so we can later tell whether a body was
+# base64-encoded or a unicode string
+class Base64Body (bytes):
+    def __new__ (cls, value):
+        return bytes.__new__ (cls, b64decode (value))
+
+    @classmethod
+    def fromBytes (cls, b):
+        """ For testing """
+        return cls (b64encode (b))
+
+class UnicodeBody (bytes):
+    def __new__ (cls, value):
+        if type (value) is not str:
+            raise TypeError ('expecting unicode string')
 
-    __slots__ = ('chromeRequest', 'chromeResponse', 'chromeFinished',
-            'isRedirect', 'failed', 'body', 'requestBody')
+        return bytes.__new__ (cls, value.encode ('utf-8'))
 
-    def __init__ (self):
-        self.chromeRequest = {}
-        self.chromeResponse = {}
-        self.chromeFinished = {}
-        self.isRedirect = False
-        self.failed = False
-        self.body = None
-        self.requestBody = None
+class Request:
+    __slots__ = ('headers', 'body', 'initiator', 'hasPostData', 'method', 'timestamp')
+
+    def __init__ (self, method=None, headers=None, body=None):
+        self.headers = headers
+        self.body = body
+        self.hasPostData = False
+        self.initiator = None
+        # HTTP method
+        self.method = method
+        self.timestamp = None
+
+    def __repr__ (self):
+        return f'Request({self.method!r}, {self.headers!r}, {self.body!r})'
+
+    def __eq__ (self, b):
+        if b is None:
+            return False
+
+        if not isinstance (b, Request):
+            raise TypeError ('Can only compare equality with Request.')
+
+        # do not compare hasPostData (only required to fetch body) and
+        # timestamp (depends on time)
+        return self.headers == b.headers and \
+                self.body == b.body and \
+                self.initiator == b.initiator and \
+                self.method == b.method
+
+class Response:
+    __slots__ = ('status', 'statusText', 'headers', 'body', 'bytesReceived',
+            'timestamp', 'mimeType')
+
+    def __init__ (self, status=None, statusText=None, headers=None, body=None, mimeType=None):
+        self.status = status
+        self.statusText = statusText
+        self.headers = headers
+        self.body = body
+        # bytes received over the network (not body size!)
+        self.bytesReceived = 0
+        self.timestamp = None
+        self.mimeType = mimeType
+
+    def __repr__ (self):
+        return f'Response({self.status!r}, {self.statusText!r}, {self.headers!r}, {self.body!r}, {self.mimeType!r})'
+
+    def __eq__ (self, b):
+        if b is None:
+            return False
+
+        if not isinstance (b, Response):
+            raise TypeError ('Can only compare equality with Response.')
+
+        # do not compare bytesReceived (depends on network), timestamp
+        # (depends on time) and statusText (does not matter)
+        return self.status == b.status and \
+                self.statusText == b.statusText and \
+                self.headers == b.headers and \
+                self.body == b.body and \
+                self.mimeType == b.mimeType
+
+class ReferenceTimestamp:
+    """ Map relative timestamp to absolute timestamp """
+
+    def __init__ (self, relative, absolute):
+        self.relative = timedelta (seconds=relative)
+        self.absolute = datetime.utcfromtimestamp (absolute)
+
+    def __call__ (self, relative):
+        if not isinstance (relative, timedelta):
+            relative = timedelta (seconds=relative)
+        return self.absolute + (relative-self.relative)
+
+class RequestResponsePair:
+    __slots__ = ('request', 'response', 'id', 'url', 'remoteIpAddress',
+            'protocol', 'resourceType', '_time')
+
+    def __init__ (self, id=None, url=None, request=None, response=None):
+        self.request = request
+        self.response = response
+        self.id = id
+        self.url = url
+        self.remoteIpAddress = None
+        self.protocol = None
+        self.resourceType = None
+        self._time = None
 
     def __repr__ (self):
-        return '<Item {}>'.format (self.url)
-
-    @property
-    def request (self):
-        return self.chromeRequest.get ('request', {})
-
-    @property
-    def response (self):
-        return self.chromeResponse.get ('response', {})
-
-    @property
-    def initiator (self):
-        return self.chromeRequest['initiator']
-
-    @property
-    def id (self):
-        return self.chromeRequest['requestId']
-
-    @property
-    def encodedDataLength (self):
-        return self.chromeFinished['encodedDataLength']
-
-    @property
-    def url (self):
-        return self.response.get ('url', self.request.get ('url'))
-
-    @property
-    def parsedUrl (self):
-        return urlsplit (self.url)
-
-    @property
-    def requestHeaders (self):
-        # the response object may contain refined headers, which were
-        # *actually* sent over the wire
-        return self._unfoldHeaders (self.response.get ('requestHeaders', self.request['headers']))
-
-    @property
-    def responseHeaders (self):
-        return self._unfoldHeaders (self.response['headers'])
-
-    @property
-    def statusText (self):
-        text = self.response.get ('statusText')
-        if text:
-            return text
-        text = BaseHTTPRequestHandler.responses.get (self.response['status'])
-        if text:
-            return text[0]
-        return 'No status text available'
-
-    @property
-    def resourceType (self):
-        return self.chromeResponse.get ('type', self.chromeRequest.get ('type', None))
+        return f'RequestResponsePair({self.id!r}, {self.url!r}, {self.request!r}, {self.response!r})'
+
+    def __eq__ (self, b):
+        if not isinstance (b, RequestResponsePair):
+            raise TypeError (f'Can only compare with {self.__class__.__name__}')
+
+        # do not compare id and _time. These depend on external factors and do
+        # not influence the request/response *content*
+        return self.request == b.request and \
+                self.response == b.response and \
+                self.url == b.url and \
+                self.remoteIpAddress == b.remoteIpAddress and \
+                self.protocol == b.protocol and \
+                self.resourceType == b.resourceType
+
+    def fromRequestWillBeSent (self, req):
+        """ Set request data from Chrome Network.requestWillBeSent event """
+        r = req['request']
+
+        self.id = req['requestId']
+        self.url = URL (r['url'])
+        self.resourceType = req.get ('type')
+        self._time = ReferenceTimestamp (req['timestamp'], req['wallTime'])
+
+        assert self.request is None, req
+        self.request = Request ()
+        self.request.initiator = req['initiator']
+        self.request.headers = CIMultiDict (self._unfoldHeaders (r['headers']))
+        self.request.hasPostData = r.get ('hasPostData', False)
+        self.request.method = r['method']
+        self.request.timestamp = self._time (req['timestamp'])
+        if self.request.hasPostData:
+            postData = r.get ('postData')
+            if postData is not None:
+                self.request.body = UnicodeBody (postData)
+
+    def fromResponse (self, r, timestamp=None, resourceType=None):
+        """
+        Set response data from Chrome’s Response object.
+        
+        Request must exist. Updates if response was set before. Sometimes
+        fromResponseReceived is triggered twice by Chrome. No idea why.
+        """
+        assert self.request is not None, (self.request, r)
+
+        if not timestamp:
+            timestamp = self.request.timestamp
+
+        self.remoteIpAddress = r.get ('remoteIPAddress')
+        self.protocol = r.get ('protocol')
+        if resourceType:
+            self.resourceType = resourceType
+
+        # a response may contain updated request headers (i.e. those actually
+        # sent over the wire)
+        if 'requestHeaders' in r:
+            self.request.headers = CIMultiDict (self._unfoldHeaders (r['requestHeaders']))
+
+        self.response = Response ()
+        self.response.headers = CIMultiDict (self._unfoldHeaders (r['headers']))
+        self.response.status = r['status']
+        self.response.statusText = r['statusText']
+        self.response.timestamp = timestamp
+        self.response.mimeType = r['mimeType']
+
+    def fromResponseReceived (self, resp):
+        """ Set response data from Chrome Network.responseReceived """
+        return self.fromResponse (resp['response'],
+                self._time (resp['timestamp']), resp['type'])
+
+    def fromLoadingFinished (self, data):
+        self.response.bytesReceived = data['encodedDataLength']
+
+    def fromLoadingFailed (self, data):
+        self.response = None
 
     @staticmethod
     def _unfoldHeaders (headers):
@@ -114,67 +227,46 @@ class Item:
                 items.append ((k, v))
         return items
 
-    def setRequest (self, req):
-        self.chromeRequest = req
-
-    def setResponse (self, resp):
-        self.chromeResponse = resp
-
-    def setFinished (self, finished):
-        self.chromeFinished = finished
-
     async def prefetchRequestBody (self, tab):
-        # request body
-        req = self.request
-        postData = req.get ('postData')
-        if postData:
-            self.requestBody = postData.encode ('utf8'), False
-        elif req.get ('hasPostData', False):
+        if self.request.hasPostData and self.request.body is None:
             try:
                 postData = await tab.Network.getRequestPostData (requestId=self.id)
-                postData = postData['postData']
-                self.requestBody = b64decode (postData), True
+                self.request.body = UnicodeBody (postData['postData'])
             except TabException:
-                self.requestBody = None
-        else:
-            self.requestBody = None, False
+                self.request.body = None
 
     async def prefetchResponseBody (self, tab):
-        # get response body
+        """ Fetch response body """
         try:
             body = await tab.Network.getResponseBody (requestId=self.id)
-            rawBody = body['body']
-            base64Encoded = body['base64Encoded']
-            if base64Encoded:
-                rawBody = b64decode (rawBody)
+            if body['base64Encoded']:
+                self.response.body = Base64Body (body['body'])
             else:
-                rawBody = rawBody.encode ('utf8')
-            self.body = rawBody, base64Encoded
+                self.response.body = UnicodeBody (body['body'])
         except TabException:
-            self.body = None
+            self.response.body = None
+
+class NavigateError (IOError):
+    pass
 
-class VarChangeEvent:
-    """ Notify when variable is changed """
+class PageIdle:
+    """ Page idle event """
 
-    __slots__ = ('_value', 'event')
+    __slots__ = ('idle', )
 
-    def __init__ (self, value):
-        self._value = value
-        self.event = asyncio.Event()
+    def __init__ (self, idle):
+        self.idle = idle
 
-    def set (self, value):
-        if value != self._value:
-            self._value = value
-            # unblock waiting threads
-            self.event.set ()
-            self.event.clear ()
+    def __bool__ (self):
+        return self.idle
 
-    def get (self):
-        return self._value
+class FrameNavigated:
+    __slots__ = ('id', 'url', 'mimeType')
 
-    async def wait (self):
-        await self.event.wait ()
-        return self._value
+    def __init__ (self, id, url, mimeType):
+        self.id = id
+        self.url = URL (url)
+        self.mimeType = mimeType
 
 class SiteLoader:
     """
@@ -183,18 +275,18 @@ class SiteLoader:
     XXX: track popup windows/new tabs and close them
     """
 
-    __slots__ = ('requests', 'browser', 'url', 'logger', 'tab', '_iterRunning', 'idle', '_framesLoading')
+    __slots__ = ('requests', 'browser', 'logger', 'tab', '_iterRunning',
+            '_framesLoading', '_rootFrame')
     allowedSchemes = {'http', 'https'}
 
-    def __init__ (self, browser, url, logger):
+    def __init__ (self, browser, logger):
         self.requests = {}
         self.browser = Browser (url=browser)
-        self.url = url
-        self.logger = logger.bind (context=type (self).__name__, url=url)
+        self.logger = logger.bind (context=type (self).__name__)
         self._iterRunning = []
 
-        self.idle = VarChangeEvent (True)
         self._framesLoading = set ()
+        self._rootFrame = None
 
     async def __aenter__ (self):
         tab = self.tab = await self.browser.__aenter__ ()
@@ -236,6 +328,7 @@ class SiteLoader:
                 tab.Page.javascriptDialogOpening: self._javascriptDialogOpening,
                 tab.Page.frameStartedLoading: self._frameStartedLoading,
                 tab.Page.frameStoppedLoading: self._frameStoppedLoading,
+                tab.Page.frameNavigated: self._frameNavigated,
                 }
 
         # The implementation is a little advanced. Why? The goal here is to
@@ -247,36 +340,46 @@ class SiteLoader:
         # we need to block (yield) for every item completed, but not
         # handled by the consumer (caller).
         running = self._iterRunning
-        running.append (asyncio.ensure_future (self.tab.get ()))
+        tabGetTask = asyncio.ensure_future (self.tab.get ())
+        running.append (tabGetTask)
         while True:
             done, pending = await asyncio.wait (running, return_when=asyncio.FIRST_COMPLETED)
             for t in done:
                 result = t.result ()
                 if result is None:
                     pass
-                elif isinstance (result, Item):
-                    yield result
-                else:
+                elif t == tabGetTask:
                     method, data = result
                     f = handler.get (method, None)
                     if f is not None:
                         task = asyncio.ensure_future (f (**data))
                         pending.add (task)
-                    pending.add (asyncio.ensure_future (self.tab.get ()))
+                    tabGetTask = asyncio.ensure_future (self.tab.get ())
+                    pending.add (tabGetTask)
+                else:
+                    yield result
 
             running = pending
             self._iterRunning = running
 
-    async def start (self):
-        await self.tab.Page.navigate(url=self.url)
+    async def navigate (self, url):
+        ret = await self.tab.Page.navigate(url=url)
+        self.logger.debug ('navigate',
+                uuid='9d47ded2-951f-4e09-86ee-fd4151e20666', result=ret)
+        if 'errorText' in ret:
+            raise NavigateError (ret['errorText'])
+        self._rootFrame = ret['frameId']
 
     # internal chrome callbacks
     async def _requestWillBeSent (self, **kwargs):
+        self.logger.debug ('requestWillBeSent',
+                uuid='b828d75a-650d-42d2-8c66-14f4547512da', args=kwargs)
+
         reqId = kwargs['requestId']
         req = kwargs['request']
-        logger = self.logger.bind (reqId=reqId, reqUrl=req['url'])
+        url = URL (req['url'])
+        logger = self.logger.bind (reqId=reqId, reqUrl=url)
 
-        url = urlsplit (req['url'])
         if url.scheme not in self.allowedSchemes:
             return
 
@@ -286,38 +389,44 @@ class SiteLoader:
             # redirects never “finish” loading, but yield another requestWillBeSent with this key set
             redirectResp = kwargs.get ('redirectResponse')
             if redirectResp:
-                # create fake responses
-                resp = {'requestId': reqId, 'response': redirectResp, 'timestamp': kwargs['timestamp']}
-                item.setResponse (resp)
-                resp = {'requestId': reqId, 'encodedDataLength': 0, 'timestamp': kwargs['timestamp']}
-                item.setFinished (resp)
-                item.isRedirect = True
-                logger.info ('redirect', uuid='85eaec41-e2a9-49c2-9445-6f19690278b8', target=req['url'])
+                if item.url != url:
+                    # this happens for unknown reasons. the docs simply state
+                    # it can differ in case of a redirect. Fix it and move on.
+                    logger.warning ('redirect url differs',
+                            uuid='558a7df7-2258-4fe4-b16d-22b6019cc163',
+                            expected=item.url)
+                    redirectResp['url'] = str (item.url)
+                item.fromResponse (redirectResp)
+                logger.info ('redirect', uuid='85eaec41-e2a9-49c2-9445-6f19690278b8', target=url)
+                # XXX: queue this? no need to wait for it
                 await item.prefetchRequestBody (self.tab)
-                # cannot fetch request body due to race condition (item id reused)
+                # cannot fetch response body due to race condition (item id reused)
                 ret = item
             else:
                 logger.warning ('request exists', uuid='2c989142-ba00-4791-bb03-c2a14e91a56b')
 
-        item = Item ()
-        item.setRequest (kwargs)
+        item = RequestResponsePair ()
+        item.fromRequestWillBeSent (kwargs)
         self.requests[reqId] = item
-        logger.debug ('request', uuid='55c17564-1bd0-4499-8724-fa7aad65478f')
 
         return ret
 
     async def _responseReceived (self, **kwargs):
+        self.logger.debug ('responseReceived',
+                uuid='ecd67e69-401a-41cb-b4ec-eeb1f1ec6abb', args=kwargs)
+
         reqId = kwargs['requestId']
         item = self.requests.get (reqId)
         if item is None:
             return
 
         resp = kwargs['response']
-        logger = self.logger.bind (reqId=reqId, respUrl=resp['url'])
-        url = urlsplit (resp['url'])
+        url = URL (resp['url'])
+        logger = self.logger.bind (reqId=reqId, respUrl=url)
+        if item.url != url:
+            logger.error ('url mismatch', uuid='7385f45f-0b06-4cbc-81f9-67bcd72ee7d0', respUrl=url)
         if url.scheme in self.allowedSchemes:
-            logger.debug ('response', uuid='84461c4e-e8ef-4cbd-8e8e-e10a901c8bd0')
-            item.setResponse (kwargs)
+            item.fromResponseReceived (kwargs)
         else:
             logger.warning ('scheme forbidden', uuid='2ea6e5d7-dd3b-4881-b9de-156c1751c666')
 
@@ -326,32 +435,37 @@ class SiteLoader:
         Item was fully loaded. For some items the request body is not available
         when responseReceived is fired, thus move everything here.
         """
+        self.logger.debug ('loadingFinished',
+                uuid='35479405-a5b5-4395-8c33-d3601d1796b9', args=kwargs)
+
         reqId = kwargs['requestId']
         item = self.requests.pop (reqId, None)
         if item is None:
             # we never recorded this request (blacklisted scheme, for example)
             return
+        if not item.response:
+            # chrome failed to send us a responseReceived event for this item,
+            # so we can’t record it (missing request/response headers)
+            self.logger.error ('response missing',
+                    uuid='fac3ab96-3f9b-4c5a-95c7-f83b675cdcb9', requestId=item.id)
+            return
+
         req = item.request
-        logger = self.logger.bind (reqId=reqId, reqUrl=req['url'])
-        resp = item.response
-        if req['url'] != resp['url']:
-            logger.error ('url mismatch', uuid='7385f45f-0b06-4cbc-81f9-67bcd72ee7d0', respUrl=resp['url'])
-        url = urlsplit (resp['url'])
-        if url.scheme in self.allowedSchemes:
-            logger.info ('finished', uuid='5a8b4bad-f86a-4fe6-a53e-8da4130d6a02')
-            item.setFinished (kwargs)
+        if item.url.scheme in self.allowedSchemes:
+            item.fromLoadingFinished (kwargs)
+            # XXX queue both
             await asyncio.gather (item.prefetchRequestBody (self.tab), item.prefetchResponseBody (self.tab))
             return item
 
     async def _loadingFailed (self, **kwargs):
+        self.logger.info ('loadingFailed',
+                uuid='4a944e85-5fae-4aa6-9e7c-e578b29392e4', args=kwargs)
+
         reqId = kwargs['requestId']
-        self.logger.warning ('loading failed',
-                uuid='68410f13-6eea-453e-924e-c1af4601748b',
-                errorText=kwargs['errorText'],
-                blockedReason=kwargs.get ('blockedReason'))
+        logger = self.logger.bind (reqId=reqId)
         item = self.requests.pop (reqId, None)
         if item is not None:
-            item.failed = True
+            item.fromLoadingFailed (kwargs)
             return item
 
     async def _entryAdded (self, **kwargs):
@@ -381,11 +495,25 @@ class SiteLoader:
                     uuid='3ef7292e-8595-4e89-b834-0cc6bc40ee38', **kwargs)
 
     async def _frameStartedLoading (self, **kwargs):
+        self.logger.debug ('frameStartedLoading',
+                uuid='bbeb39c0-3304-4221-918e-f26bd443c566', args=kwargs)
+
         self._framesLoading.add (kwargs['frameId'])
-        self.idle.set (False)
+        return PageIdle (False)
 
     async def _frameStoppedLoading (self, **kwargs):
+        self.logger.debug ('frameStoppedLoading',
+                uuid='fcbe8110-511c-4cbb-ac2b-f61a5782c5a0', args=kwargs)
+
         self._framesLoading.remove (kwargs['frameId'])
         if not self._framesLoading:
-            self.idle.set (True)
+            return PageIdle (True)
+
+    async def _frameNavigated (self, **kwargs):
+        self.logger.debug ('frameNavigated',
+                uuid='0e876f7d-7129-4612-8632-686f42ac6e1f', args=kwargs)
+        frame = kwargs['frame']
+        if self._rootFrame == frame['id']:
+            assert frame.get ('parentId', None) is None, "root frame must not have a parent"
+            return FrameNavigated (frame['id'], frame['url'], frame['mimeType'])
 
diff --git a/crocoite/cli.py b/crocoite/cli.py
index c3c41a4..04bbb19 100644
--- a/crocoite/cli.py
+++ b/crocoite/cli.py
@@ -22,27 +22,68 @@
 Command line interface
 """
 
-import argparse, sys, signal, asyncio, os
+import argparse, sys, signal, asyncio, os, json
+from traceback import TracebackException
 from enum import IntEnum
+from yarl import URL
+from http.cookies import SimpleCookie
+import pkg_resources
+try:
+    import manhole
+    manhole.install (patch_fork=False, oneshot_on='USR1')
+except ModuleNotFoundError:
+    pass
 
-from . import behavior
+from . import behavior, browser
 from .controller import SinglePageController, \
         ControllerSettings, StatsHandler, LogHandler, \
         RecursiveController, DepthLimit, PrefixLimit
 from .devtools import Passthrough, Process
 from .warc import WarcHandler
-from .logger import Logger, JsonPrintConsumer, DatetimeConsumer, WarcHandlerConsumer
+from .logger import Logger, JsonPrintConsumer, DatetimeConsumer, \
+        WarcHandlerConsumer, Level
 from .devtools import Crashed
 
+def absurl (s):
+    """ argparse: Absolute URL """
+    u = URL (s)
+    if u.is_absolute ():
+        return u
+    raise argparse.ArgumentTypeError ('Must be absolute')
+
+def cookie (s):
+    """ argparse: Cookie """
+    c = SimpleCookie (s)
+    # for some reason the constructor does not raise an exception if the cookie
+    # supplied is invalid. It’ll simply be empty.
+    if len (c) != 1:
+        raise argparse.ArgumentTypeError ('Invalid cookie')
+    # we want a single Morsel
+    return next (iter (c.values ()))
+
+def cookiejar (f):
+    """ argparse: Cookies from file """
+    cookies = []
+    try:
+        with open (f, 'r') as fd:
+            for l in fd:
+                l = l.lstrip ()
+                if l and not l.startswith ('#'):
+                    cookies.append (cookie (l))
+    except FileNotFoundError:
+        raise argparse.ArgumentTypeError (f'Cookie jar "{f}" does not exist')
+    return cookies
+
 class SingleExitStatus(IntEnum):
     """ Exit status for single-shot command line """
     Ok = 0
     Fail = 1
     BrowserCrash = 2
+    Navigate = 3
 
 def single ():
-    parser = argparse.ArgumentParser(description='Save website to WARC using Google Chrome.')
-    parser.add_argument('--browser', help='DevTools URL', metavar='URL')
+    parser = argparse.ArgumentParser(description='crocoite helper tools to fetch individual pages.')
+    parser.add_argument('--browser', help='DevTools URL', type=absurl, metavar='URL')
     parser.add_argument('--timeout', default=1*60*60, type=int, help='Maximum time for archival', metavar='SEC')
     parser.add_argument('--idle-timeout', default=30, type=int, help='Maximum idle seconds (i.e. no requests)', dest='idleTimeout', metavar='SEC')
     parser.add_argument('--behavior', help='Enable behavior script',
@@ -50,7 +91,19 @@ def single ():
             default=list (behavior.availableMap.keys ()),
             choices=list (behavior.availableMap.keys ()),
             metavar='NAME', nargs='*')
-    parser.add_argument('url', help='Website URL', metavar='URL')
+    parser.add_argument('--warcinfo', help='Add extra information to warcinfo record',
+            metavar='JSON', type=json.loads)
+    # re-using curl’s short/long switch names whenever possible
+    parser.add_argument('-k', '--insecure',
+            action='store_true',
+            help='Disable certificate validation')
+    parser.add_argument ('-b', '--cookie', type=cookie, metavar='SET-COOKIE',
+            action='append', default=[], help='Cookies in Set-Cookie format.')
+    parser.add_argument ('-c', '--cookie-jar', dest='cookieJar',
+            type=cookiejar, metavar='FILE',
+            default=pkg_resources.resource_filename (__name__, 'data/cookies.txt'),
+            help='Cookie jar file, read-only.')
+    parser.add_argument('url', help='Website URL', type=absurl, metavar='URL')
     parser.add_argument('output', help='WARC filename', metavar='FILE')
 
     args = parser.parse_args ()
@@ -61,13 +114,19 @@ def single ():
     service = Process ()
     if args.browser:
         service = Passthrough (args.browser)
-    settings = ControllerSettings (idleTimeout=args.idleTimeout, timeout=args.timeout)
+    settings = ControllerSettings (
+            idleTimeout=args.idleTimeout,
+            timeout=args.timeout,
+            insecure=args.insecure,
+            cookies=args.cookieJar + args.cookie,
+            )
     with open (args.output, 'wb') as fd, WarcHandler (fd, logger) as warcHandler:
         logger.connect (WarcHandlerConsumer (warcHandler))
         handler = [StatsHandler (), LogHandler (logger), warcHandler]
         b = list (map (lambda x: behavior.availableMap[x], args.enabledBehaviorNames))
         controller = SinglePageController (url=args.url, settings=settings,
-                service=service, handler=handler, behavior=b, logger=logger)
+                service=service, handler=handler, behavior=b, logger=logger,
+                warcinfo=args.warcinfo)
         try:
             loop = asyncio.get_event_loop()
             run = asyncio.ensure_future (controller.run ())
@@ -79,9 +138,20 @@ def single ():
             ret = SingleExitStatus.Ok
         except Crashed:
             ret = SingleExitStatus.BrowserCrash
+        except asyncio.CancelledError:
+            # don’t log this one
+            pass
+        except browser.NavigateError:
+            ret = SingleExitStatus.Navigate
+        except Exception as e:
+            ret = SingleExitStatus.Fail
+            logger.error ('cli exception',
+                    uuid='7fd69858-ecaa-4225-b213-8ab880aa3cc5',
+                    traceback=list (TracebackException.from_exception (e).format ()))
         finally:
             r = handler[0].stats
             logger.info ('stats', context='cli', uuid='24d92d16-770e-4088-b769-4020e127a7ff', **r)
+            logger.info ('exit', context='cli', uuid='9b1bd603-f7cd-4745-895a-5b894a5166f2', status=ret)
 
     return ret
 
@@ -92,68 +162,84 @@ def parsePolicy (recursive, url):
         return DepthLimit (int (recursive))
     elif recursive == 'prefix':
         return PrefixLimit (url)
-    raise ValueError ('Unsupported')
+    raise argparse.ArgumentTypeError ('Unsupported recursion mode')
 
 def recursive ():
     logger = Logger (consumer=[DatetimeConsumer (), JsonPrintConsumer ()])
 
-    parser = argparse.ArgumentParser(description='Recursively run crocoite-grab.')
-    parser.add_argument('--policy', help='Recursion policy', metavar='POLICY')
-    parser.add_argument('--tempdir', help='Directory for temporary files', metavar='DIR')
-    parser.add_argument('--prefix', help='Output filename prefix, supports templates {host} and {date}', metavar='FILENAME', default='{host}-{date}-')
-    parser.add_argument('--concurrency', '-j', help='Run at most N jobs', metavar='N', default=1, type=int)
-    parser.add_argument('url', help='Seed URL', metavar='URL')
-    parser.add_argument('output', help='Output directory', metavar='DIR')
-    parser.add_argument('command', help='Fetch command, supports templates {url} and {dest}', metavar='CMD', nargs='*', default=['crocoite-grab', '{url}', '{dest}'])
+    parser = argparse.ArgumentParser(description='Save website to WARC using Google Chrome.')
+    parser.add_argument('-j', '--concurrency',
+            help='Run at most N jobs concurrently', metavar='N', default=1,
+            type=int)
+    parser.add_argument('-r', '--recursion', help='Recursion policy',
+            metavar='POLICY')
+    parser.add_argument('--tempdir', help='Directory for temporary files',
+            metavar='DIR')
+    parser.add_argument('url', help='Seed URL', type=absurl, metavar='URL')
+    parser.add_argument('output',
+            help='Output file, supports templates {host}, {date} and {seqnum}',
+            metavar='FILE')
+    parser.add_argument('command',
+            help='Fetch command, supports templates {url} and {dest}',
+            metavar='CMD', nargs='*',
+            default=['crocoite-single', '{url}', '{dest}'])
 
     args = parser.parse_args ()
     try:
-        policy = parsePolicy (args.policy, args.url)
-    except ValueError:
-        parser.error ('Invalid argument for --policy')
-
-    os.makedirs (args.output, exist_ok=True)
+        policy = parsePolicy (args.recursion, args.url)
+    except argparse.ArgumentTypeError as e:
+        parser.error (str (e))
 
-    controller = RecursiveController (url=args.url, output=args.output,
-            command=args.command, logger=logger, policy=policy,
-            tempdir=args.tempdir, prefix=args.prefix,
-            concurrency=args.concurrency)
+    try:
+        controller = RecursiveController (url=args.url, output=args.output,
+                command=args.command, logger=logger, policy=policy,
+                tempdir=args.tempdir, concurrency=args.concurrency)
+    except ValueError as e:
+        parser.error (str (e))
 
+    run = asyncio.ensure_future (controller.run ())
     loop = asyncio.get_event_loop()
-    stop = lambda signum: controller.cancel ()
+    stop = lambda signum: run.cancel ()
     loop.add_signal_handler (signal.SIGINT, stop, signal.SIGINT)
     loop.add_signal_handler (signal.SIGTERM, stop, signal.SIGTERM)
-    loop.run_until_complete(controller.run ())
-    loop.close()
+    try:
+        loop.run_until_complete(run)
+    except asyncio.CancelledError:
+        pass
+    finally:
+        loop.close()
 
     return 0
 
 def irc ():
-    from configparser import ConfigParser
+    import json, re
     from .irc import Chromebot
 
     logger = Logger (consumer=[DatetimeConsumer (), JsonPrintConsumer ()])
 
     parser = argparse.ArgumentParser(description='IRC bot.')
-    parser.add_argument('--config', '-c', help='Config file location', metavar='PATH', default='chromebot.ini')
+    parser.add_argument('--config', '-c', help='Config file location', metavar='PATH', default='chromebot.json')
 
     args = parser.parse_args ()
 
-    config = ConfigParser ()
-    config.read (args.config)
+    with open (args.config) as fd:
+        config = json.load (fd)
     s = config['irc']
+    blacklist = dict (map (lambda x: (re.compile (x[0], re.I), x[1]), config['blacklist'].items ()))
 
     loop = asyncio.get_event_loop()
     bot = Chromebot (
-            host=s.get ('host'),
-            port=s.getint ('port'),
-            ssl=s.getboolean ('ssl'),
-            nick=s.get ('nick'),
-            channels=[s.get ('channel')],
-            tempdir=s.get ('tempdir'),
-            destdir=s.get ('destdir'),
-            processLimit=s.getint ('process_limit'),
+            host=s['host'],
+            port=s['port'],
+            ssl=s['ssl'],
+            nick=s['nick'],
+            channels=s['channels'],
+            tempdir=config['tempdir'],
+            destdir=config['destdir'],
+            processLimit=config['process_limit'],
             logger=logger,
+            blacklist=blacklist,
+            needVoice=config['need_voice'],
             loop=loop)
     stop = lambda signum: bot.cancel ()
     loop.add_signal_handler (signal.SIGINT, stop, signal.SIGINT)
diff --git a/crocoite/controller.py b/crocoite/controller.py
index f8b1420..8374b4e 100644
--- a/crocoite/controller.py
+++ b/crocoite/controller.py
@@ -22,58 +22,56 @@
 Controller classes, handling actions required for archival
 """
 
-import time
-import tempfile, asyncio, json, os
+import time, tempfile, asyncio, json, os, shutil, signal
 from itertools import islice
 from datetime import datetime
-from urllib.parse import urlparse
 from operator import attrgetter
+from abc import ABC, abstractmethod
+from yarl import URL
 
 from . import behavior as cbehavior
-from .browser import SiteLoader, Item
-from .util import getFormattedViewportMetrics, getSoftwareInfo, removeFragment
+from .browser import SiteLoader, RequestResponsePair, PageIdle, FrameNavigated
+from .util import getFormattedViewportMetrics, getSoftwareInfo
 from .behavior import ExtractLinksEvent
+from .devtools import toCookieParam
 
 class ControllerSettings:
-    __slots__ = ('idleTimeout', 'timeout')
+    __slots__ = ('idleTimeout', 'timeout', 'insecure', 'cookies')
 
-    def __init__ (self, idleTimeout=2, timeout=10):
+    def __init__ (self, idleTimeout=2, timeout=10, insecure=False, cookies=None):
         self.idleTimeout = idleTimeout
         self.timeout = timeout
+        self.insecure = insecure
+        self.cookies = cookies or []
 
-    def toDict (self):
-        return dict (idleTimeout=self.idleTimeout, timeout=self.timeout)
+    def __repr__ (self):
+        return f'<ControllerSetting idleTimeout={self.idleTimeout!r}, timeout={self.timeout!r}, insecure={self.insecure!r}, cookies={self.cookies!r}>'
 
 defaultSettings = ControllerSettings ()
 
-class EventHandler:
+class EventHandler (ABC):
     """ Abstract base class for event handler """
 
     __slots__ = ()
 
-    # this handler wants to know about exceptions before they are reraised by
-    # the controller
-    acceptException = False
-
-    def push (self, item):
+    @abstractmethod
+    async def push (self, item):
         raise NotImplementedError ()
 
 class StatsHandler (EventHandler):
     __slots__ = ('stats', )
 
-    acceptException = True
-
     def __init__ (self):
         self.stats = {'requests': 0, 'finished': 0, 'failed': 0, 'bytesRcv': 0}
 
-    def push (self, item):
-        if isinstance (item, Item):
+    async def push (self, item):
+        if isinstance (item, RequestResponsePair):
             self.stats['requests'] += 1
-            if item.failed:
+            if not item.response:
                 self.stats['failed'] += 1
             else:
                 self.stats['finished'] += 1
-                self.stats['bytesRcv'] += item.encodedDataLength
+                self.stats['bytesRcv'] += item.response.bytesReceived
 
 class LogHandler (EventHandler):
     """ Handle items by logging information about them """
@@ -83,7 +81,7 @@ class LogHandler (EventHandler):
     def __init__ (self, logger):
         self.logger = logger.bind (context=type (self).__name__)
 
-    def push (self, item):
+    async def push (self, item):
         if isinstance (item, ExtractLinksEvent):
             # limit number of links per message, so json blob won’t get too big
             it = iter (item.links)
@@ -102,6 +100,71 @@ class ControllerStart:
     def __init__ (self, payload):
         self.payload = payload
 
+class IdleStateTracker (EventHandler):
+    """ Track SiteLoader’s idle state by listening to PageIdle events """
+
+    __slots__ = ('_idle', '_loop', '_idleSince')
+
+    def __init__ (self, loop):
+        self._idle = True
+        self._loop = loop
+
+        self._idleSince = self._loop.time ()
+
+    async def push (self, item):
+        if isinstance (item, PageIdle):
+            self._idle = bool (item)
+            if self._idle:
+                self._idleSince = self._loop.time ()
+
+    async def wait (self, timeout):
+        """ Wait until page has been idle for at least timeout seconds. If the
+        page has been idle before calling this function it may return
+        immediately. """
+
+        assert timeout > 0
+        while True:
+            if self._idle:
+                now = self._loop.time ()
+                sleep = timeout-(now-self._idleSince)
+                if sleep <= 0:
+                    break
+            else:
+                # not idle, check again after timeout expires
+                sleep = timeout
+            await asyncio.sleep (sleep)
+
+class InjectBehaviorOnload (EventHandler):
+    """ Control behavior script injection based on frame navigation messages.
+    When a page is reloaded (for whatever reason), the scripts need to be
+    reinjected. """
+
+    __slots__ = ('controller', '_loaded')
+
+    def __init__ (self, controller):
+        self.controller = controller
+        self._loaded = False
+
+    async def push (self, item):
+        if isinstance (item, FrameNavigated):
+            await self._runon ('load')
+            self._loaded = True
+
+    async def stop (self):
+        if self._loaded:
+            await self._runon ('stop')
+
+    async def finish (self):
+        if self._loaded:
+            await self._runon ('finish')
+
+    async def _runon (self, method):
+        controller = self.controller
+        for b in controller._enabledBehavior:
+            f = getattr (b, 'on' + method)
+            async for item in f ():
+                await controller.processItem (item)
+
 class SinglePageController:
     """
     Archive a single page url.
@@ -110,120 +173,141 @@ class SinglePageController:
     (stats, warc writer).
     """
 
-    __slots__ = ('url', 'service', 'behavior', 'settings', 'logger', 'handler')
+    __slots__ = ('url', 'service', 'behavior', 'settings', 'logger', 'handler',
+            'warcinfo', '_enabledBehavior')
 
     def __init__ (self, url, logger, \
             service, behavior=cbehavior.available, \
-            settings=defaultSettings, handler=[]):
+            settings=defaultSettings, handler=None, \
+            warcinfo=None):
         self.url = url
         self.service = service
         self.behavior = behavior
         self.settings = settings
         self.logger = logger.bind (context=type (self).__name__, url=url)
-        self.handler = handler
+        self.handler = handler or []
+        self.warcinfo = warcinfo
 
-    def processItem (self, item):
+    async def processItem (self, item):
         for h in self.handler:
-            h.push (item)
+            await h.push (item)
 
     async def run (self):
         logger = self.logger
         async def processQueue ():
             async for item in l:
-                self.processItem (item)
+                await self.processItem (item)
+
+        idle = IdleStateTracker (asyncio.get_event_loop ())
+        self.handler.append (idle)
+        behavior = InjectBehaviorOnload (self)
+        self.handler.append (behavior)
 
-        async with self.service as browser, SiteLoader (browser, self.url, logger=logger) as l:
+        async with self.service as browser, SiteLoader (browser, logger=logger) as l:
             handle = asyncio.ensure_future (processQueue ())
+            timeoutProc = asyncio.ensure_future (asyncio.sleep (self.settings.timeout))
 
-            start = time.time ()
+            # configure browser
+            tab = l.tab
+            await tab.Security.setIgnoreCertificateErrors (ignore=self.settings.insecure)
+            await tab.Network.setCookies (cookies=list (map (toCookieParam, self.settings.cookies)))
 
             # not all behavior scripts are allowed for every URL, filter them
-            enabledBehavior = list (filter (lambda x: self.url in x,
+            self._enabledBehavior = list (filter (lambda x: self.url in x,
                     map (lambda x: x (l, logger), self.behavior)))
 
-            version = await l.tab.Browser.getVersion ()
+            version = await tab.Browser.getVersion ()
             payload = {
                     'software': getSoftwareInfo (),
                     'browser': {
                         'product': version['product'],
                         'useragent': version['userAgent'],
-                        'viewport': await getFormattedViewportMetrics (l.tab),
+                        'viewport': await getFormattedViewportMetrics (tab),
                         },
                     'tool': 'crocoite-single', # not the name of the cli utility
                     'parameters': {
                         'url': self.url,
                         'idleTimeout': self.settings.idleTimeout,
                         'timeout': self.settings.timeout,
-                        'behavior': list (map (attrgetter('name'), enabledBehavior)),
+                        'behavior': list (map (attrgetter('name'), self._enabledBehavior)),
+                        'insecure': self.settings.insecure,
+                        'cookies': list (map (lambda x: x.OutputString(), self.settings.cookies)),
                         },
                     }
-            self.processItem (ControllerStart (payload))
+            if self.warcinfo:
+                payload['extra'] = self.warcinfo
+            await self.processItem (ControllerStart (payload))
 
-            await l.start ()
-            for b in enabledBehavior:
-                async for item in b.onload ():
-                    self.processItem (item)
+            await l.navigate (self.url)
 
-            # wait until the browser has a) been idle for at least
-            # settings.idleTimeout or b) settings.timeout is exceeded
-            timeoutProc = asyncio.ensure_future (asyncio.sleep (self.settings.timeout))
-            idleTimeout = None
+            idleProc = asyncio.ensure_future (idle.wait (self.settings.idleTimeout))
             while True:
-                idleProc = asyncio.ensure_future (l.idle.wait ())
                 try:
                     finished, pending = await asyncio.wait([idleProc, timeoutProc, handle],
-                            return_when=asyncio.FIRST_COMPLETED, timeout=idleTimeout)
+                            return_when=asyncio.FIRST_COMPLETED)
                 except asyncio.CancelledError:
                     idleProc.cancel ()
                     timeoutProc.cancel ()
                     break
 
-                if not finished:
-                    # idle timeout
-                    idleProc.cancel ()
-                    timeoutProc.cancel ()
-                    break
-                elif handle in finished:
+                if handle in finished:
                     # something went wrong while processing the data
+                    logger.error ('fetch failed',
+                        uuid='43a0686a-a3a9-4214-9acd-43f6976f8ff3')
                     idleProc.cancel ()
                     timeoutProc.cancel ()
                     handle.result ()
                     assert False # previous line should always raise Exception
                 elif timeoutProc in finished:
                     # global timeout
+                    logger.debug ('global timeout',
+                            uuid='2f858adc-9448-4ace-94b4-7cd1484c0728')
                     idleProc.cancel ()
                     timeoutProc.result ()
                     break
                 elif idleProc in finished:
-                    # idle state change
-                    isIdle = idleProc.result ()
-                    if isIdle:
-                        # browser is idle, start the clock
-                        idleTimeout = self.settings.idleTimeout
-                    else:
-                        idleTimeout = None
-
-            for b in enabledBehavior:
-                async for item in b.onstop ():
-                    self.processItem (item)
-            await l.tab.Page.stopLoading ()
+                    # idle timeout
+                    logger.debug ('idle timeout',
+                            uuid='90702590-94c4-44ef-9b37-02a16de444c3')
+                    idleProc.result ()
+                    timeoutProc.cancel ()
+                    break
 
+            await behavior.stop ()
+            await tab.Page.stopLoading ()
             await asyncio.sleep (1)
+            await behavior.finish ()
 
-            for b in enabledBehavior:
-                async for item in b.onfinish ():
-                    self.processItem (item)
-
-            # wait until loads from behavior scripts are done
-            await asyncio.sleep (1)
-            if not l.idle.get ():
-                while not await l.idle.wait (): pass
+            # wait until loads from behavior scripts are done and browser is
+            # idle for at least 1 second
+            try:
+                await asyncio.wait_for (idle.wait (1), timeout=1)
+            except (asyncio.TimeoutError, asyncio.CancelledError):
+                pass
 
             if handle.done ():
                 handle.result ()
             else:
                 handle.cancel ()
 
+class SetEntry:
+    """ A object, to be used with sets, that compares equality only on its
+    primary property. """
+    def __init__ (self, value, **props):
+        self.value = value
+        for k, v in props.items ():
+            setattr (self, k, v)
+
+    def __eq__ (self, b):
+        assert isinstance (b, SetEntry)
+        return self.value == b.value
+
+    def __hash__ (self):
+        return hash (self.value)
+
+    def __repr__ (self):
+        return f'<SetEntry {self.value!r}>'
+
 class RecursionPolicy:
     """ Abstract recursion policy """
 
@@ -242,19 +326,17 @@ class DepthLimit (RecursionPolicy):
     __slots__ = ('maxdepth', )
 
     def __init__ (self, maxdepth=0):
-        if maxdepth < 0 or maxdepth > 1:
-            raise ValueError ('Unsupported')
         self.maxdepth = maxdepth
 
     def __call__ (self, urls):
-        if self.maxdepth <= 0:
-            return {}
-        else:
-            self.maxdepth -= 1
-            return urls
+        newurls = set ()
+        for u in urls:
+            if u.depth <= self.maxdepth:
+                newurls.add (u)
+        return newurls
 
     def __repr__ (self):
-        return '<DepthLimit {}>'.format (self.maxdepth)
+        return f'<DepthLimit {self.maxdepth}>'
 
 class PrefixLimit (RecursionPolicy):
     """
@@ -271,7 +353,11 @@ class PrefixLimit (RecursionPolicy):
         self.prefix = prefix
 
     def __call__ (self, urls):
-        return set (filter (lambda u: u.startswith (self.prefix), urls))
+        return set (filter (lambda u: str(u.value).startswith (str (self.prefix)), urls))
+
+def hasTemplate (s):
+    """ Return True if string s has string templates """
+    return '{' in s and '}' in s
 
 class RecursiveController:
     """
@@ -281,47 +367,59 @@ class RecursiveController:
     """
 
     __slots__ = ('url', 'output', 'command', 'logger', 'policy', 'have',
-            'pending', 'stats', 'prefix', 'tempdir', 'running', 'concurrency', '_quit')
+            'pending', 'stats', 'tempdir', 'running', 'concurrency',
+            'copyLock')
 
     SCHEME_WHITELIST = {'http', 'https'}
 
-    def __init__ (self, url, output, command, logger, prefix='{host}-{date}-',
+    def __init__ (self, url, output, command, logger,
             tempdir=None, policy=DepthLimit (0), concurrency=1):
         self.url = url
         self.output = output
         self.command = command
-        self.prefix = prefix
         self.logger = logger.bind (context=type(self).__name__, seedurl=url)
         self.policy = policy
         self.tempdir = tempdir
+        # A lock if only a single output file (no template) is requested
+        self.copyLock = None if hasTemplate (output) else asyncio.Lock ()
+        # some sanity checks. XXX move to argparse?
+        if self.copyLock and os.path.exists (self.output):
+                raise ValueError ('Output file exists')
         # tasks currently running
         self.running = set ()
         # max number of tasks running
         self.concurrency = concurrency
         # keep in sync with StatsHandler
         self.stats = {'requests': 0, 'finished': 0, 'failed': 0, 'bytesRcv': 0, 'crashed': 0, 'ignored': 0}
-        # initiate graceful shutdown
-        self._quit = False
 
-    async def fetch (self, url):
+    async def fetch (self, entry, seqnum):
         """
         Fetch a single URL using an external command
 
-        command is usually crocoite-grab
+        command is usually crocoite-single
         """
 
+        assert isinstance (entry, SetEntry)
+
+        url = entry.value
+        depth = entry.depth
         logger = self.logger.bind (url=url)
 
         def formatCommand (e):
-            return e.format (url=url, dest=dest.name)
+            # provide means to disable variable expansion
+            if e.startswith ('!'):
+                return e[1:]
+            else:
+                return e.format (url=url, dest=dest.name)
 
-        def formatPrefix (p):
-            return p.format (host=urlparse (url).hostname, date=datetime.utcnow ().isoformat ())
+        def formatOutput (p):
+            return p.format (host=url.host,
+                    date=datetime.utcnow ().isoformat (), seqnum=seqnum)
 
         def logStats ():
             logger.info ('stats', uuid='24d92d16-770e-4088-b769-4020e127a7ff', **self.stats)
 
-        if urlparse (url).scheme not in self.SCHEME_WHITELIST:
+        if url.scheme not in self.SCHEME_WHITELIST:
             self.stats['ignored'] += 1
             logStats ()
             self.logger.warning ('scheme not whitelisted', url=url,
@@ -329,69 +427,115 @@ class RecursiveController:
             return
 
         dest = tempfile.NamedTemporaryFile (dir=self.tempdir,
-                prefix=formatPrefix (self.prefix), suffix='.warc.gz',
+                prefix=os.path.basename (self.output) + '-', suffix='.warc.gz',
                 delete=False)
-        destpath = os.path.join (self.output, os.path.basename (dest.name))
         command = list (map (formatCommand, self.command))
-        logger.info ('fetch', uuid='1680f384-744c-4b8a-815b-7346e632e8db', command=command, destfile=destpath)
-        process = await asyncio.create_subprocess_exec (*command, stdout=asyncio.subprocess.PIPE,
-                stderr=asyncio.subprocess.DEVNULL, stdin=asyncio.subprocess.DEVNULL,
-                start_new_session=True)
-        while True:
-            data = await process.stdout.readline ()
-            if not data:
-                break
-            data = json.loads (data)
-            uuid = data.get ('uuid')
-            if uuid == '8ee5e9c9-1130-4c5c-88ff-718508546e0c':
-                links = set (self.policy (map (removeFragment, data.get ('links', []))))
-                links.difference_update (self.have)
-                self.pending.update (links)
-            elif uuid == '24d92d16-770e-4088-b769-4020e127a7ff':
-                for k in self.stats.keys ():
-                    self.stats[k] += data.get (k, 0)
+        logger.info ('fetch', uuid='d1288fbe-8bae-42c8-af8c-f2fa8b41794f',
+                command=command)
+        try:
+            process = await asyncio.create_subprocess_exec (*command,
+                    stdout=asyncio.subprocess.PIPE,
+                    stderr=asyncio.subprocess.DEVNULL,
+                    stdin=asyncio.subprocess.DEVNULL,
+                    start_new_session=True, limit=100*1024*1024)
+            while True:
+                data = await process.stdout.readline ()
+                if not data:
+                    break
+                data = json.loads (data)
+                uuid = data.get ('uuid')
+                if uuid == '8ee5e9c9-1130-4c5c-88ff-718508546e0c':
+                    links = set (self.policy (map (lambda x: SetEntry (URL(x).with_fragment(None), depth=depth+1), data.get ('links', []))))
+                    links.difference_update (self.have)
+                    self.pending.update (links)
+                elif uuid == '24d92d16-770e-4088-b769-4020e127a7ff':
+                    for k in self.stats.keys ():
+                        self.stats[k] += data.get (k, 0)
+                    logStats ()
+        except asyncio.CancelledError:
+            # graceful cancellation
+            process.send_signal (signal.SIGINT)
+        except Exception as e:
+            process.kill ()
+            raise e
+        finally:
+            code = await process.wait()
+            if code == 0:
+                if self.copyLock is None:
+                    # atomically move once finished
+                    lastDestpath = None
+                    while True:
+                        # XXX: must generate a new name every time, otherwise
+                        # this loop never terminates
+                        destpath = formatOutput (self.output)
+                        assert destpath != lastDestpath
+                        lastDestpath = destpath
+
+                        # python does not have rename(…, …, RENAME_NOREPLACE),
+                        # but this is safe nontheless, since we’re
+                        # single-threaded
+                        if not os.path.exists (destpath):
+                            # create the directory, so templates like
+                            # /{host}/{date}/… are possible
+                            os.makedirs (os.path.dirname (destpath), exist_ok=True)
+                            os.rename (dest.name, destpath)
+                            break
+                else:
+                    # atomically (in the context of this process) append to
+                    # existing file
+                    async with self.copyLock:
+                        with open (dest.name, 'rb') as infd, \
+                                open (self.output, 'ab') as outfd:
+                            shutil.copyfileobj (infd, outfd)
+                        os.unlink (dest.name)
+            else:
+                self.stats['crashed'] += 1
                 logStats ()
-        code = await process.wait()
-        if code == 0:
-            # atomically move once finished
-            os.rename (dest.name, destpath)
-        else:
-            self.stats['crashed'] += 1
-            logStats ()
-
-    def cancel (self):
-        """ Gracefully cancel this job, waiting for existing workers to shut down """
-        self.logger.info ('cancel',
-                uuid='d58154c8-ec27-40f2-ab9e-e25c1b21cd88',
-                pending=len (self.pending), have=len (self.have),
-                running=len (self.running))
-        self._quit = True
 
     async def run (self):
         def log ():
+            # self.have includes running jobs
             self.logger.info ('recursing',
                     uuid='5b8498e4-868d-413c-a67e-004516b8452c',
-                    pending=len (self.pending), have=len (self.have),
+                    pending=len (self.pending),
+                    have=len (self.have)-len(self.running),
                     running=len (self.running))
 
-        self.have = set ()
-        self.pending = set ([self.url])
-
-        while self.pending and not self._quit:
-            # since pending is a set this picks a random item, which is fine
-            u = self.pending.pop ()
-            self.have.add (u)
-            t = asyncio.ensure_future (self.fetch (u))
-            self.running.add (t)
-
+        seqnum = 1
+        try:
+            self.have = set ()
+            self.pending = set ([SetEntry (self.url, depth=0)])
+
+            while self.pending:
+                # since pending is a set this picks a random item, which is fine
+                u = self.pending.pop ()
+                self.have.add (u)
+                t = asyncio.ensure_future (self.fetch (u, seqnum))
+                self.running.add (t)
+                seqnum += 1
+
+                log ()
+
+                if len (self.running) >= self.concurrency or not self.pending:
+                    done, pending = await asyncio.wait (self.running,
+                            return_when=asyncio.FIRST_COMPLETED)
+                    self.running.difference_update (done)
+                    # propagate exceptions
+                    for r in done:
+                        r.result ()
+        except asyncio.CancelledError:
+            self.logger.info ('cancel',
+                    uuid='d58154c8-ec27-40f2-ab9e-e25c1b21cd88',
+                    pending=len (self.pending),
+                    have=len (self.have)-len (self.running),
+                    running=len (self.running))
+        finally:
+            done = await asyncio.gather (*self.running,
+                    return_exceptions=True)
+            # propagate exceptions
+            for r in done:
+                if isinstance (r, Exception):
+                    raise r
+            self.running = set ()
             log ()
 
-            if len (self.running) >= self.concurrency or not self.pending:
-                done, pending = await asyncio.wait (self.running,
-                        return_when=asyncio.FIRST_COMPLETED)
-                self.running.difference_update (done)
-
-        done = asyncio.gather (*self.running)
-        self.running = set ()
-        log ()
-
diff --git a/crocoite/data/click.yaml b/crocoite/data/click.yaml
index f88d24d..78278b9 100644
--- a/crocoite/data/click.yaml
+++ b/crocoite/data/click.yaml
@@ -2,91 +2,116 @@
 # Example URLs are random. Believe me.
 match: ^www\.facebook\.com$
 selector:
-  - description: show more comments
-    selector: a.UFIPagerLink[role=button]
+  - description: Show comments and replies/nested comments on user pages.
+    selector: form[action="/ajax/ufi/modify.php"] a[data-testid^="UFI2CommentsPagerRenderer/pager_depth_"]
     urls: ["https://www.facebook.com/tagesschau"]
-  - description: show nested comments
-    selector: a.UFICommentLink[role=button]
-  - description: initially show comments below a single post/video, i.e. /user/post/123
-    selector: form.commentable_item a[data-comment-prelude-ref=action_link_bling][rel=ignore]
+  - description: Initially show comments below a single post/video, i.e. /user/post/123.
+    selector: form[action="/ajax/ufi/modify.php"] a[data-testid="UFI2CommentsCount/root"]
     urls: ["https://www.facebook.com/tagesschau/posts/10157061068659407"]
-  - description: close the “register now” nag screen. for better screen shots
+  - description: Close the “register now” nag screen. For screenshots.
     selector: a#expanding_cta_close_button[role=button]
     urls: ["https://www.facebook.com/tagesschau"]
 ---
 match: ^twitter\.com$
 selector:
-  - description: expand threads
+  - description: Expand threads.
     selector: a.ThreadedConversation-moreRepliesLink
     urls: ["https://twitter.com/realDonaldTrump/status/1068826073775964160"]
-  - description: show hidden profiles
+  - description: Show hidden profiles.
     selector: button.ProfileWarningTimeline-button
     urls: ["https://twitter.com/CookieCyboid"]
-  - description: show hidden/sensitive media. For screen-/snapshots.
+  - description: Show hidden/sensitive media. For screen-/snapshots.
     selector: button.Tombstone-action.js-display-this-media
     urls: ["https://twitter.com/CookieCyboid/status/1070807283305713665"]
+  - description: Show more replies.
+    selector: button.ThreadedConversation-showMoreThreadsButton
+    urls: ["https://twitter.com/fuglydug/status/1172160128101076995"]
 ---
 match: ^disqus\.com$
 selector:
-  - description: load more comments
+  - description: Load more comments.
     selector: a.load-more__button
     multi: True
 ---
-match: ^(www|np)\.reddit\.com$
+# new layout
+match: ^www\.reddit\.com$
 selector:
-  - description: show more comments, reddit’s javascript ignores events if too frequent
-    selector: span.morecomments a
+  - description: Show more comments.
+    selector: div[id^=moreComments-] > div > p
+    # reddit’s javascript ignores events if too frequent
     throttle: 500
-    # disabled: No idea why it is not working. The selector is fine.
-    #urls: ["https://www.reddit.com/r/funny/comments/a21rxz/well_this_was_a_highlight_of_my_day/"]
+    urls: ["https://www.reddit.com/r/subredditcancer/comments/b2b80f/we_are_moderators_of_rwatchpeopledie_amaa_just/"]
 ---
-match: ^www\.instagram\.com$
+# old layout
+match: ^(old|np)\.reddit\.com$
 selector:
-  - description: load more comments
-    selector: article div ul li button[type=button]
-    multi: True
-    urls: ["https://www.instagram.com/p/BqvAm_XnmdJ/"]
+  - description: Show more comments.
+    selector: span.morecomments a
+    # reddit’s javascript ignores events if too frequent
+    throttle: 500
+    urls: ["https://old.reddit.com/r/subredditcancer/comments/b2b80f/we_are_moderators_of_rwatchpeopledie_amaa_just/"]
 ---
 match: ^www\.youtube\.com$
 selector:
-  - description: expand comment thread
-    selector: ytd-comment-thread-renderer div.more-button
+  - description: Expand single comment.
+    selector: ytd-comment-thread-renderer span[slot=more-button]
     urls: ["https://www.youtube.com/watch?v=udtFqQuBFSc"]
+  - description: Show more comment thread replies.
+    selector: div.ytd-comment-replies-renderer > yt-next-continuation > paper-button
+    urls: ["https://www.youtube.com/watch?v=Lov0T3eXI2k"]
+    multi: True
 ---
 match: ^www\.patreon\.com$
 selector:
-  - description: load more content
-    # this selector is so long, because there are no stable css classes
-    selector: div.col-xs-12 > div > div > div > div[display="flex"] > div > button[tabindex="0"][color="tertiary"][type="button"]
-    urls: ["https://www.patreon.com/nkjemisin"]
-  - description: load more comments
-    selector: div[display=flex] div[display=block] a[color="dark"][role="button"][tabindex="0"]
+  - description: Load more comments.
+    selector: div[data-tag=post-card] button[data-tag=loadMoreCommentsCta]
     urls: ["https://www.patreon.com/posts/what-im-on-22124040"]
-  - description: load more replies
-    selector: div > a[scale="0"][color=blue][size="1"]
 ---
-match: ^(www\.)?gab\.ai$
+match: ^(www\.)?gab\.com$
 selector:
-  - description: more replies
-    selector: post-detail post-comment .post-comment__replies__count a
-    urls: ["https://gab.ai/gab/posts/40014689"]
-  - description: more comments
-    selector: post-detail .post-comment-list__loading a
-    urls: ["https://gab.ai/gab/posts/41804462"]
-  - description: more posts
-    selector: post-list a.post-list__load-more
+  - description: Load more posts.
+    selector: div.item-list[role=feed] button.load-more
     multi: True
-    urls: ["https://gab.ai/gab"]
+    urls: ["https://gab.com/gab"]
 ---
 match: ^(www\.)?github\.com$
 selector:
-  - description: show hidden issue items
+  - description: Show hidden issue items.
     urls: ["https://github.com/dominictarr/event-stream/issues/116"]
     selector: div#discussion_bucket form.ajax-pagination-form button.ajax-pagination-btn
 ---
 match: ^www\.gamasutra\.com$
 selector:
-    - description: Load more comments
+    - description: Load more comments.
       urls: ["http://www.gamasutra.com/blogs/RaminShokrizade/20130626/194933/The_Top_F2P_Monetization_Tricks.php"]
       selector: div#dynamiccomments div.viewTopCmts a
-
+---
+match: ^(www\.)?steamcommunity\.com$
+selector:
+    - description: Load more content.
+      urls: ["https://steamcommunity.com/app/252950/reviews/?p=1&browsefilter=toprated&filterLanguage=all"]
+      selector: "#GetMoreContentBtn a"
+      multi: True
+---
+match: ^imgur\.com$
+selector:
+    - description: Load more images of an album.
+      urls: ["https://imgur.com/a/JG1yc"]
+      selector: div.js-post-truncated a.post-loadall
+    - description: Expand all comments. For snapshots.
+      urls: ["https://imgur.com/a/JG1yc"]
+      selector: div.comments-info span.comments-expand
+    - description: Show bad replies. for snapshots.
+      urls: ["https://imgur.com/gallery/jRzMfRG"]
+      selector: div#comments div.bad-captions a.link
+---
+match: ^(www\.)?vimeo\.com$
+selector:
+    - description: Load more videos on profile page.
+      urls: ["https://vimeo.com/dsam4a"]
+      selector: div.profile_main div.profile-load-more__button--wrapper button
+#    XXX: this works when using a non-headless browser, but does not otherwise
+#    - description: Expand video comments
+#      urls: ["https://vimeo.com/22439234"]
+#      selector: section#comments button.iris_comment-more
+#      multi: True
diff --git a/crocoite/data/cookies.txt b/crocoite/data/cookies.txt
new file mode 100644
index 0000000..6ac62c3
--- /dev/null
+++ b/crocoite/data/cookies.txt
@@ -0,0 +1,9 @@
+# Default cookies for crocoite. This file does *not* use Netscape’s cookie
+# file format. Lines are expected to be in Set-Cookie format.
+# And this line is a comment.
+
+# Reddit:
+# skip over 18 prompt
+over18=1; Domain=www.reddit.com
+# skip quarantined subreddit prompt
+_options={%22pref_quarantine_optin%22:true}; Domain=www.reddit.com
diff --git a/crocoite/data/extract-links.js b/crocoite/data/extract-links.js
index 4d1a3d0..5a4f9f0 100644
--- a/crocoite/data/extract-links.js
+++ b/crocoite/data/extract-links.js
@@ -25,11 +25,26 @@ function isClickable (o) {
 }
 /* --- end copy&paste */
 
-let x = document.body.querySelectorAll('a[href]');
 let ret = [];
+['a[href]', 'area[href]'].forEach (function (s) {
+	let x = document.querySelectorAll(s);
+	for (let i=0; i < x.length; i++) {
+		if (isClickable (x[i])) {
+			ret.push (x[i].href);
+		}
+	}
+});
+
+/* If Chrome loads plain-text documents it’ll wrap them into <pre>. Check those
+ * for links as well, assuming the whole line is a link (i.e. list of links). */
+let x = document.querySelectorAll ('body > pre');
 for (let i=0; i < x.length; i++) {
-	if (isClickable (x[i])) {
-		ret.push (x[i].href);
+	if (isVisible (x[i])) {
+		x[i].innerText.split ('\n').forEach (function (s) {
+			if (s.match ('^https?://')) {
+				ret.push (s);
+			}
+		});
 	}
 }
 return ret; /* immediately return results, for use with Runtime.evaluate() */
diff --git a/crocoite/data/screenshot.js b/crocoite/data/screenshot.js
new file mode 100644
index 0000000..a9a41e1
--- /dev/null
+++ b/crocoite/data/screenshot.js
@@ -0,0 +1,20 @@
+/* Find and scrollable full-screen elements and return their actual size
+ */
+(function () {
+/* limit the number of elements queried */
+let elem = document.querySelectorAll ('body > div');
+let ret = [];
+for (let i = 0; i < elem.length; i++) {
+	let e = elem[i];
+	let s = window.getComputedStyle (e);
+	if (s.getPropertyValue ('position') == 'fixed' &&
+			s.getPropertyValue ('overflow') == 'auto' &&
+			s.getPropertyValue ('left') == '0px' &&
+			s.getPropertyValue ('right') == '0px' &&
+			s.getPropertyValue ('top') == '0px' &&
+			s.getPropertyValue ('bottom') == '0px') {
+		ret.push (e.scrollHeight);
+	}
+}
+return ret; /* immediately return results, for use with Runtime.evaluate() */
+})();
diff --git a/crocoite/devtools.py b/crocoite/devtools.py
index b071d2e..8b5c69d 100644
--- a/crocoite/devtools.py
+++ b/crocoite/devtools.py
@@ -25,7 +25,12 @@ Communication with Google Chrome through its DevTools protocol.
 import json, asyncio, logging, os
 from tempfile import mkdtemp
 import shutil
+from http.cookies import Morsel
+
 import aiohttp, websockets
+from yarl import URL
+
+from .util import StrJsonEncoder
 
 logger = logging.getLogger (__name__)
 
@@ -37,18 +42,17 @@ class Browser:
     Destroyed upon exit.
     """
 
-    __slots__ = ('session', 'url', 'tab', 'loop')
+    __slots__ = ('session', 'url', 'tab')
 
-    def __init__ (self, url, loop=None):
-        self.url = url
+    def __init__ (self, url):
+        self.url = URL (url)
         self.session = None
         self.tab = None
-        self.loop = loop
 
     async def __aiter__ (self):
         """ List all tabs """
-        async with aiohttp.ClientSession (loop=self.loop) as session:
-            async with session.get ('{}/json/list'.format (self.url)) as r:
+        async with aiohttp.ClientSession () as session:
+            async with session.get (self.url.with_path ('/json/list')) as r:
                 resp = await r.json ()
                 for tab in resp:
                     if tab['type'] == 'page':
@@ -58,22 +62,35 @@ class Browser:
         """ Create tab """
         assert self.tab is None
         assert self.session is None
-        self.session = aiohttp.ClientSession (loop=self.loop)
-        async with self.session.get ('{}/json/new'.format (self.url)) as r:
+        self.session = aiohttp.ClientSession ()
+        async with self.session.get (self.url.with_path ('/json/new')) as r:
             resp = await r.json ()
             self.tab = await Tab.create (**resp)
             return self.tab
 
-    async def __aexit__ (self, *args):
+    async def __aexit__ (self, excType, excValue, traceback):
         assert self.tab is not None
         assert self.session is not None
+
         await self.tab.close ()
-        async with self.session.get ('{}/json/close/{}'.format (self.url, self.tab.id)) as r:
-            resp = await r.text ()
-            assert resp == 'Target is closing'
+
+        try:
+            async with self.session.get (self.url.with_path (f'/json/close/{self.tab.id}')) as r:
+                resp = await r.text ()
+                assert resp == 'Target is closing'
+        except aiohttp.client_exceptions.ClientConnectorError:
+            # oh boy, the whole browser crashed instead
+            if excType is Crashed:
+                # exception is reraised by `return False`
+                pass
+            else:
+                # this one is more important
+                raise
+
         self.tab = None
         await self.session.close ()
         self.session = None
+
         return False
 
 class TabFunction:
@@ -101,13 +118,13 @@ class TabFunction:
         return hash (self.name)
 
     def __getattr__ (self, k):
-        return TabFunction ('{}.{}'.format (self.name, k), self.tab)
+        return TabFunction (f'{self.name}.{k}', self.tab)
 
     async def __call__ (self, **kwargs):
         return await self.tab (self.name, **kwargs)
 
     def __repr__ (self):
-        return '<TabFunction {}>'.format (self.name)
+        return f'<TabFunction {self.name}>'
 
 class TabException (Exception):
     pass
@@ -154,8 +171,8 @@ class Tab:
         self.msgid += 1
         message = {'method': method, 'params': kwargs, 'id': msgid}
         t = self.transactions[msgid] = {'event': asyncio.Event (), 'result': None}
-        logger.debug ('← {}'.format (message))
-        await self.ws.send (json.dumps (message))
+        logger.debug (f'← {message}')
+        await self.ws.send (json.dumps (message, cls=StrJsonEncoder))
         await t['event'].wait ()
         ret = t['result']
         del self.transactions[msgid]
@@ -189,7 +206,7 @@ class Tab:
                 # right now we cannot recover from this
                 await markCrashed (e)
                 break
-            logger.debug ('→ {}'.format (msg))
+            logger.debug (f'→ {msg}')
             if 'id' in msg:
                 msgid = msg['id']
                 t = self.transactions.get (msgid, None)
@@ -266,11 +283,11 @@ class Process:
 
     async def __aenter__ (self):
         assert self.p is None
-        self.userDataDir = mkdtemp ()
+        self.userDataDir = mkdtemp (prefix=__package__ + '-chrome-userdata-')
         # see https://github.com/GoogleChrome/chrome-launcher/blob/master/docs/chrome-flags-for-tools.md
         args = [self.binary,
                 '--window-size={},{}'.format (*self.windowSize),
-                '--user-data-dir={}'.format (self.userDataDir), # use temporory user dir
+                f'--user-data-dir={self.userDataDir}', # use temporory user dir
                 '--no-default-browser-check',
                 '--no-first-run', # don’t show first run screen
                 '--disable-breakpad', # no error reports
@@ -315,12 +332,26 @@ class Process:
         if port is None:
             raise Exception ('Chrome died on us.')
 
-        return 'http://localhost:{}'.format (port)
+        return URL.build(scheme='http', host='localhost', port=port)
 
     async def __aexit__ (self, *exc):
-        self.p.terminate ()
-        await self.p.wait ()
-        shutil.rmtree (self.userDataDir)
+        try:
+            self.p.terminate ()
+            await self.p.wait ()
+        except ProcessLookupError:
+            # ok, fine, dead already
+            pass
+
+        # Try to delete the temporary directory multiple times. It looks like
+        # Chrome will change files in there even after it exited (i.e. .wait()
+        # returned). Very strange.
+        for i in range (5):
+            try:
+                shutil.rmtree (self.userDataDir)
+                break
+            except:
+                await asyncio.sleep (0.2)
+
         self.p = None
         return False
 
@@ -328,7 +359,7 @@ class Passthrough:
     __slots__ = ('url', )
 
     def __init__ (self, url):
-        self.url = url
+        self.url = URL (url)
 
     async def __aenter__ (self):
         return self.url
@@ -336,3 +367,26 @@ class Passthrough:
     async def __aexit__ (self, *exc):
         return False
 
+def toCookieParam (m):
+    """
+    Convert Python’s http.cookies.Morsel to Chrome’s CookieParam, see
+    https://chromedevtools.github.io/devtools-protocol/1-3/Network#type-CookieParam
+    """
+
+    assert isinstance (m, Morsel)
+
+    out = {'name': m.key, 'value': m.value}
+
+    # unsupported by chrome
+    for k in ('max-age', 'comment', 'version'):
+        if m[k]:
+            raise ValueError (f'Unsupported cookie attribute {k} set, cannot convert')
+
+    for mname, cname in [('expires', None), ('path', None), ('domain', None), ('secure', None), ('httponly', 'httpOnly')]:
+        value = m[mname]
+        if value:
+            cname = cname or mname
+            out[cname] = value
+
+    return out
+
diff --git a/crocoite/html.py b/crocoite/html.py
index fec9760..30f6ca5 100644
--- a/crocoite/html.py
+++ b/crocoite/html.py
@@ -107,6 +107,8 @@ eventAttributes = {'onabort',
         'onvolumechange',
         'onwaiting'}
 
+default_namespace = constants.namespaces["html"]
+
 class ChromeTreeWalker (TreeWalker):
     """
     Recursive html5lib TreeWalker for Google Chrome method DOM.getDocument
@@ -122,11 +124,14 @@ class ChromeTreeWalker (TreeWalker):
             elif name == '#document':
                 for child in node.get ('children', []):
                     yield from self.recurse (child)
+            elif name == '#cdata-section':
+                # html5lib cannot generate cdata, so we’re faking it by using
+                # an empty tag
+                yield from self.emptyTag (default_namespace,
+                        '![CDATA[' + node['nodeValue'] + ']]', {})
             else:
-                assert False, name
+                assert False, (name, node)
         else:
-            default_namespace = constants.namespaces["html"]
-
             attributes = node.get ('attributes', [])
             convertedAttr = {}
             for i in range (0, len (attributes), 2):
diff --git a/crocoite/irc.py b/crocoite/irc.py
index 99485e4..d9c0634 100644
--- a/crocoite/irc.py
+++ b/crocoite/irc.py
@@ -22,16 +22,19 @@
 IRC bot “chromebot”
 """
 
-import asyncio, argparse, uuid, json, tempfile
+import asyncio, argparse, json, tempfile, time, random, os, shlex
 from datetime import datetime
 from urllib.parse import urlsplit
-from enum import IntEnum, Enum
+from enum import IntEnum, unique
 from collections import defaultdict
 from abc import abstractmethod
 from functools import wraps
 import bottom
 import websockets
 
+from .util import StrJsonEncoder
+from .cli import cookie
+
 ### helper functions ###
 def prettyTimeDelta (seconds):
     """
@@ -53,7 +56,7 @@ def prettyBytes (b):
     while b >= 1024 and len (prefixes) > 1:
         b /= 1024
         prefixes.pop (0)
-    return '{:.1f} {}'.format (b, prefixes[0])
+    return f'{b:.1f} {prefixes[0]}'
 
 def isValidUrl (s):
     url = urlsplit (s)
@@ -84,13 +87,45 @@ class Status(IntEnum):
     aborted = 3
     finished = 4
 
+# see https://arxiv.org/html/0901.4016 on how to build proquints (human
+# pronouncable unique ids)
+toConsonant = 'bdfghjklmnprstvz'
+toVowel = 'aiou'
+
+def u16ToQuint (v):
+    """ Transform a 16 bit unsigned integer into a single quint """
+    assert 0 <= v < 2**16
+    # quints are “big-endian”
+    return ''.join ([
+            toConsonant[(v>>(4+2+4+2))&0xf],
+            toVowel[(v>>(4+2+4))&0x3],
+            toConsonant[(v>>(4+2))&0xf],
+            toVowel[(v>>4)&0x3],
+            toConsonant[(v>>0)&0xf],
+            ])
+
+def uintToQuint (v, length=2):
+    """ Turn any integer into a proquint with fixed length """
+    assert 0 <= v < 2**(length*16)
+
+    return '-'.join (reversed ([u16ToQuint ((v>>(x*16))&0xffff) for x in range (length)]))
+
+def makeJobId ():
+    """ Create job id from time and randomness source """
+    # allocate 48 bits for the time (in milliseconds) and add 16 random bits
+    # at the end (just to be sure) for a total of 64 bits. Should be enough to
+    # avoid collisions.
+    randbits = 16
+    stamp = (int (time.time ()*1000) << randbits) | random.randint (0, 2**randbits-1)
+    return uintToQuint (stamp, 4)
+
 class Job:
     """ Archival job """
 
     __slots__ = ('id', 'stats', 'rstats', 'started', 'finished', 'nick', 'status', 'process', 'url')
 
     def __init__ (self, url, nick):
-        self.id = str (uuid.uuid4 ())
+        self.id = makeJobId ()
         self.stats = {}
         self.rstats = {}
         self.started = datetime.utcnow ()
@@ -104,32 +139,40 @@ class Job:
     def formatStatus (self):
         stats = self.stats
         rstats = self.rstats
-        return '{} ({}) {}. {} pages finished, {} pending; {} crashed, {} requests, {} failed, {} received.'.format (
-                self.url,
-                self.id,
-                self.status.name,
-                rstats.get ('have', 0),
-                rstats.get ('pending', 0),
-                stats.get ('crashed', 0),
-                stats.get ('requests', 0),
-                stats.get ('failed', 0),
-                prettyBytes (stats.get ('bytesRcv', 0)))
-
-class NickMode(Enum):
-    operator = '@'
-    voice = '+'
+        return (f"{self.url} ({self.id}) {self.status.name}. "
+                f"{rstats.get ('have', 0)} pages finished, "
+                f"{rstats.get ('pending', 0)} pending; "
+                f"{stats.get ('crashed', 0)} crashed, "
+                f"{stats.get ('requests', 0)} requests, "
+                f"{stats.get ('failed', 0)} failed, "
+                f"{prettyBytes (stats.get ('bytesRcv', 0))} received.")
+
+@unique
+class NickMode(IntEnum):
+    # the actual numbers don’t matter, but their order must be strictly
+    # increasing (with priviledge level)
+    operator = 100
+    voice = 10
 
     @classmethod
     def fromMode (cls, mode):
         return {'v': cls.voice, 'o': cls.operator}[mode]
 
+    @classmethod
+    def fromNickPrefix (cls, mode):
+        return {'@': cls.operator, '+': cls.voice}[mode]
+
+    @property
+    def human (self):
+        return {self.operator: 'operator', self.voice: 'voice'}[self]
+
 class User:
     """ IRC user """
     __slots__ = ('name', 'modes')
 
-    def __init__ (self, name, modes=set ()):
+    def __init__ (self, name, modes=None):
         self.name = name
-        self.modes = modes
+        self.modes = modes or set ()
 
     def __eq__ (self, b):
         return self.name == b.name
@@ -138,15 +181,21 @@ class User:
         return hash (self.name)
 
     def __repr__ (self):
-        return '<User {} {}>'.format (self.name, self.modes)
+        return f'<User {self.name} {self.modes}>'
+
+    def hasPriv (self, p):
+        if p is None:
+            return True
+        else:
+            return self.modes and max (self.modes) >= p
 
     @classmethod
     def fromName (cls, name):
         """ Get mode and name from NAMES command """
         try:
-            modes = {NickMode(name[0])}
+            modes = {NickMode.fromNickPrefix (name[0])}
             name = name[1:]
-        except ValueError:
+        except KeyError:
             modes = set ()
         return cls (name, modes)
 
@@ -159,7 +208,8 @@ class ReplyContext:
         self.user = user
 
     def __call__ (self, message):
-        self.client.send ('PRIVMSG', target=self.target, message='{}: {}'.format (self.user.name, message))
+        self.client.send ('PRIVMSG', target=self.target,
+                message=f'{self.user.name}: {message}')
 
 class RefCountEvent:
     """
@@ -200,9 +250,9 @@ class ArgparseBot (bottom.Client):
 
     __slots__ = ('channels', 'nick', 'parser', 'users', '_quit')
 
-    def __init__ (self, host, port, ssl, nick, logger, channels=[], loop=None):
+    def __init__ (self, host, port, ssl, nick, logger, channels=None, loop=None):
         super().__init__ (host=host, port=port, ssl=ssl, loop=loop)
-        self.channels = channels
+        self.channels = channels or []
         self.nick = nick
         # map channel -> nick -> user
         self.users = defaultdict (dict)
@@ -259,8 +309,13 @@ class ArgparseBot (bottom.Client):
             self.send ('JOIN', channel=c)
             # no need for NAMES here, server sends this automatically
 
-    async def onNameReply (self, target, channel_type, channel, users, **kwargs):
-        self.users[channel] = dict (map (lambda x: (x.name, x), map (User.fromName, users)))
+    async def onNameReply (self, channel, users, **kwargs):
+        # channels may be too big for a single message
+        addusers = dict (map (lambda x: (x.name, x), map (User.fromName, users)))
+        if channel not in self.users:
+            self.users[channel] = addusers
+        else:
+            self.users[channel].update (addusers)
 
     @staticmethod
     def parseMode (mode):
@@ -274,7 +329,7 @@ class ArgparseBot (bottom.Client):
                 ret.append ((action, c))
         return ret
 
-    async def onMode (self, nick, user, host, channel, modes, params, **kwargs):
+    async def onMode (self, channel, modes, params, **kwargs):
         if channel not in self.channels:
             return
 
@@ -290,7 +345,7 @@ class ArgparseBot (bottom.Client):
                 # unknown mode, ignore
                 pass
 
-    async def onPart (self, nick, user, host, message, channel, **kwargs):
+    async def onPart (self, nick, channel, **kwargs):
         if channel not in self.channels:
             return
 
@@ -312,23 +367,27 @@ class ArgparseBot (bottom.Client):
 
     async def onMessage (self, nick, target, message, **kwargs):
         """ Message received """
-        if target in self.channels and message.startswith (self.nick):
+        msgPrefix = self.nick + ':'
+        if target in self.channels and message.startswith (msgPrefix):
             user = self.users[target].get (nick, User (nick))
             reply = ReplyContext (client=self, target=target, user=user)
 
-            # channel message that starts with our nick
-            command = message.split (' ')[1:]
+            # shlex.split supports quoting arguments, which str.split() does not
+            command = shlex.split (message[len (msgPrefix):])
             try:
                 args = self.parser.parse_args (command)
             except Exception as e:
-                reply ('{} -- {}'.format (e.args[1], e.args[0].format_usage ()))
+                reply (f'{e.args[1]} -- {e.args[0].format_usage ()}')
                 return
-            if not args:
-                reply ('Sorry, I don’t understand {}'.format (command))
+            if not args or not hasattr (args, 'func'):
+                reply (f'Sorry, I don’t understand {command}')
                 return
 
+            minPriv = getattr (args, 'minPriv', None)
             if self._quit.armed and not getattr (args, 'allowOnShutdown', False):
                 reply ('Sorry, I’m shutting down and cannot accept your request right now.')
+            elif not user.hasPriv (minPriv):
+                reply (f'Sorry, you need the privilege {minPriv.human} to use this command.')
             else:
                 with self._quit:
                     await args.func (user=user, args=args, reply=reply)
@@ -336,23 +395,14 @@ class ArgparseBot (bottom.Client):
     async def onDisconnect (self, **kwargs):
         """ Auto-reconnect """
         self.logger.info ('disconnect', uuid='4c74b2c8-2403-4921-879d-2279ad85db72')
-        if not self._quit.armed:
-            await asyncio.sleep (10, loop=self.loop)
-            self.logger.info ('reconnect', uuid='c53555cb-e1a4-4b69-b1c9-3320269c19d7')
-            await self.connect ()
-
-def voice (func):
-    """ Calling user must have voice or ops """
-    @wraps (func)
-    async def inner (self, *args, **kwargs):
-        user = kwargs.get ('user')
-        reply = kwargs.get ('reply')
-        if not user.modes.intersection ({NickMode.operator, NickMode.voice}):
-            reply ('Sorry, you must have voice to use this command.')
-        else:
-            ret = await func (self, *args, **kwargs)
-            return ret
-    return inner
+        while True:
+            if not self._quit.armed:
+                await asyncio.sleep (10, loop=self.loop)
+                self.logger.info ('reconnect', uuid='c53555cb-e1a4-4b69-b1c9-3320269c19d7')
+                try:
+                    await self.connect ()
+                finally:
+                    break
 
 def jobExists (func):
     """ Chromebot job exists """
@@ -363,38 +413,45 @@ def jobExists (func):
         reply = kwargs.get ('reply')
         j = self.jobs.get (args.id, None)
         if not j:
-            reply ('Job {} is unknown'.format (args.id))
+            reply (f'Job {args.id} is unknown')
         else:
             ret = await func (self, job=j, **kwargs)
             return ret
     return inner
 
 class Chromebot (ArgparseBot):
-    __slots__ = ('jobs', 'tempdir', 'destdir', 'processLimit')
+    __slots__ = ('jobs', 'tempdir', 'destdir', 'processLimit', 'blacklist', 'needVoice')
+
+    def __init__ (self, host, port, ssl, nick, logger, channels=None,
+            tempdir=None, destdir='.', processLimit=1,
+            blacklist={}, needVoice=False, loop=None):
+        self.needVoice = needVoice
 
-    def __init__ (self, host, port, ssl, nick, logger, channels=[],
-            tempdir=tempfile.gettempdir(), destdir='.', processLimit=1,
-            loop=None):
         super().__init__ (host=host, port=port, ssl=ssl, nick=nick,
                 logger=logger, channels=channels, loop=loop)
 
         self.jobs = {}
-        self.tempdir = tempdir
+        self.tempdir = tempdir or tempfile.gettempdir()
         self.destdir = destdir
         self.processLimit = asyncio.Semaphore (processLimit)
+        self.blacklist = blacklist
 
     def getParser (self):
         parser = NonExitingArgumentParser (prog=self.nick + ': ', add_help=False)
         subparsers = parser.add_subparsers(help='Sub-commands')
 
         archiveparser = subparsers.add_parser('a', help='Archive a site', add_help=False)
-        #archiveparser.add_argument('--timeout', default=1*60*60, type=int, help='Maximum time for archival', metavar='SEC', choices=[60, 1*60*60, 2*60*60])
-        #archiveparser.add_argument('--idle-timeout', default=10, type=int, help='Maximum idle seconds (i.e. no requests)', dest='idleTimeout', metavar='SEC', choices=[1, 10, 20, 30, 60])
-        #archiveparser.add_argument('--max-body-size', default=None, type=int, dest='maxBodySize', help='Max body size', metavar='BYTES', choices=[1*1024*1024, 10*1024*1024, 100*1024*1024])
         archiveparser.add_argument('--concurrency', '-j', default=1, type=int, help='Parallel workers for this job', choices=range (1, 5))
         archiveparser.add_argument('--recursive', '-r', help='Enable recursion', choices=['0', '1', 'prefix'], default='0')
+        archiveparser.add_argument('--insecure', '-k',
+                help='Disable certificate checking', action='store_true')
+        # parsing the cookie here, so we can give an early feedback without
+        # waiting for the job to crash on invalid arguments.
+        archiveparser.add_argument('--cookie', '-b', type=cookie,
+                help='Add a cookie', action='append', default=[])
         archiveparser.add_argument('url', help='Website URL', type=isValidUrl, metavar='URL')
-        archiveparser.set_defaults (func=self.handleArchive)
+        archiveparser.set_defaults (func=self.handleArchive,
+                minPriv=NickMode.voice if self.needVoice else None)
 
         statusparser = subparsers.add_parser ('s', help='Get job status', add_help=False)
         statusparser.add_argument('id', help='Job id', metavar='UUID')
@@ -402,31 +459,70 @@ class Chromebot (ArgparseBot):
 
         abortparser = subparsers.add_parser ('r', help='Revoke/abort job', add_help=False)
         abortparser.add_argument('id', help='Job id', metavar='UUID')
-        abortparser.set_defaults (func=self.handleAbort, allowOnShutdown=True)
+        abortparser.set_defaults (func=self.handleAbort, allowOnShutdown=True,
+                minPriv=NickMode.voice if self.needVoice else None)
 
         return parser
 
-    @voice
+    def isBlacklisted (self, url):
+        for k, v in self.blacklist.items():
+            if k.match (url):
+                return v
+        return False
+
     async def handleArchive (self, user, args, reply):
         """ Handle the archive command """
 
-        j = Job (args.url, user.name)
-        assert j.id not in self.jobs, 'duplicate job id'
+        msg = self.isBlacklisted (args.url)
+        if msg:
+            reply (f'{args.url} cannot be queued: {msg}')
+            return
+
+        # make sure the job id is unique. Since ids are time-based we can just
+        # wait.
+        while True:
+            j = Job (args.url, user.name)
+            if j.id not in self.jobs:
+                break
+            time.sleep (0.01)
         self.jobs[j.id] = j
 
         logger = self.logger.bind (job=j.id)
 
-        cmdline = ['crocoite-recursive', args.url, '--tempdir', self.tempdir,
-                '--prefix', j.id + '-{host}-{date}-', '--policy',
-                args.recursive, '--concurrency', str (args.concurrency),
-                self.destdir]
-
         showargs = {
                 'recursive': args.recursive,
                 'concurrency': args.concurrency,
                 }
+        if args.insecure:
+            showargs['insecure'] = args.insecure
+        warcinfo = {'chromebot': {
+                'jobid': j.id,
+                'user': user.name,
+                'queued': j.started,
+                'url': args.url,
+                'recursive': args.recursive,
+                'concurrency': args.concurrency,
+                }}
+        grabCmd = ['crocoite-single']
+        # prefix warcinfo with !, so it won’t get expanded
+        grabCmd.extend (['--warcinfo',
+                '!' + json.dumps (warcinfo, cls=StrJsonEncoder)])
+        for v in args.cookie:
+            grabCmd.extend (['--cookie', v.OutputString ()])
+        if args.insecure:
+            grabCmd.append ('--insecure')
+        grabCmd.extend (['{url}', '{dest}'])
+        cmdline = ['crocoite',
+                '--tempdir', self.tempdir,
+                '--recursion', args.recursive,
+                '--concurrency', str (args.concurrency),
+                args.url,
+                os.path.join (self.destdir,
+                        j.id + '-{host}-{date}-{seqnum}.warc.gz'),
+                '--'] + grabCmd
+
         strargs = ', '.join (map (lambda x: '{}={}'.format (*x), showargs.items ()))
-        reply ('{} has been queued as {} with {}'.format (args.url, j.id, strargs))
+        reply (f'{args.url} has been queued as {j.id} with {strargs}')
         logger.info ('queue', user=user.name, url=args.url, cmdline=cmdline,
                 uuid='36cc34a6-061b-4cc5-84a9-4ab6552c8d75')
 
@@ -437,7 +533,7 @@ class Chromebot (ArgparseBot):
                         stdout=asyncio.subprocess.PIPE,
                         stderr=asyncio.subprocess.DEVNULL,
                         stdin=asyncio.subprocess.DEVNULL,
-                        start_new_session=True)
+                        start_new_session=True, limit=100*1024*1024)
                 while True:
                     data = await j.process.stdout.readline ()
                     if not data:
@@ -477,7 +573,6 @@ class Chromebot (ArgparseBot):
         rstats = job.rstats
         reply (job.formatStatus ())
 
-    @voice
     @jobExists
     async def handleAbort (self, user, args, reply, job):
         """ Handle abort command """
@@ -541,7 +636,11 @@ class Dashboard:
         if not buf:
             return
 
-        data = json.loads (buf)
+        try:
+            data = json.loads (buf)
+        except json.decoder.JSONDecodeError:
+            # ignore invalid
+            return
         msgid = data['uuid']
 
         if msgid in self.ignoreMsgid:
@@ -554,9 +653,8 @@ class Dashboard:
         elif msgid == '5c0f9a11-dcd8-4182-a60f-54f4d3ab3687':
             nesteddata = data['data']
             nestedmsgid = nesteddata['uuid']
-            if nestedmsgid == '1680f384-744c-4b8a-815b-7346e632e8db':
+            if nestedmsgid == 'd1288fbe-8bae-42c8-af8c-f2fa8b41794f':
                 del nesteddata['command']
-                del nesteddata['destfile']
             
         buf = json.dumps (data)
         for c in self.clients:
diff --git a/crocoite/logger.py b/crocoite/logger.py
index cddc42d..ac389ca 100644
--- a/crocoite/logger.py
+++ b/crocoite/logger.py
@@ -34,6 +34,8 @@ from enum import IntEnum
 
 from pytz import utc
 
+from .util import StrJsonEncoder
+
 class Level(IntEnum):
     DEBUG = 0
     INFO = 1
@@ -41,9 +43,9 @@ class Level(IntEnum):
     ERROR = 3
 
 class Logger:
-    def __init__ (self, consumer=[], bindings={}):
-        self.bindings = bindings
-        self.consumer = consumer
+    def __init__ (self, consumer=None, bindings=None):
+        self.bindings = bindings or {}
+        self.consumer = consumer or []
 
     def __call__ (self, level, *args, **kwargs):
         if not isinstance (level, Level):
@@ -102,24 +104,13 @@ class PrintConsumer (Consumer):
         sys.stderr.flush ()
         return kwargs
 
-class JsonEncoder (json.JSONEncoder):
-    def default (self, obj):
-        if isinstance (obj, datetime):
-            return obj.isoformat ()
-
-        # make sure serialization always succeeds
-        try:
-            return json.JSONEncoder.default(self, obj)
-        except TypeError:
-            return str (obj)
-
 class JsonPrintConsumer (Consumer):
-    def __init__ (self, minLevel=Level.INFO):
+    def __init__ (self, minLevel=Level.DEBUG):
         self.minLevel = minLevel
 
     def __call__ (self, **kwargs):
         if kwargs['level'] >= self.minLevel:
-            json.dump (kwargs, sys.stdout, cls=JsonEncoder)
+            json.dump (kwargs, sys.stdout, cls=StrJsonEncoder)
             sys.stdout.write ('\n')
             sys.stdout.flush ()
         return kwargs
@@ -130,12 +121,12 @@ class DatetimeConsumer (Consumer):
         return kwargs
 
 class WarcHandlerConsumer (Consumer):
-    def __init__ (self, warc, minLevel=Level.INFO):
+    def __init__ (self, warc, minLevel=Level.DEBUG):
         self.warc = warc
         self.minLevel = minLevel
 
     def __call__ (self, **kwargs):
         if kwargs['level'] >= self.minLevel:
-            self.warc._writeLog (json.dumps (kwargs, cls=JsonEncoder))
+            self.warc._writeLog (json.dumps (kwargs, cls=StrJsonEncoder))
         return kwargs
 
diff --git a/crocoite/test_behavior.py b/crocoite/test_behavior.py
index 280b35b..1efea08 100644
--- a/crocoite/test_behavior.py
+++ b/crocoite/test_behavior.py
@@ -18,19 +18,24 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 
-import asyncio, os, yaml, re
-from urllib.parse import urlparse
+import asyncio, os, yaml, re, math, struct
 from functools import partial
+from operator import attrgetter
+
 import pytest
+from yarl import URL
+from aiohttp import web
 
 import pkg_resources
 from .logger import Logger
 from .devtools import Process
-from .behavior import Scroll, Behavior
-from .controller import SinglePageController
+from .behavior import Scroll, Behavior, ExtractLinks, ExtractLinksEvent, Crash, \
+        Screenshot, ScreenshotEvent, DomSnapshot, DomSnapshotEvent, mapOrIgnore
+from .controller import SinglePageController, EventHandler, ControllerSettings
+from .devtools import Crashed
 
 with pkg_resources.resource_stream (__name__, os.path.join ('data', 'click.yaml')) as fd:
-    sites = list (yaml.load_all (fd))
+    sites = list (yaml.safe_load_all (fd))
 clickParam = []
 for o in sites:
     for s in o['selector']:
@@ -67,7 +72,7 @@ class ClickTester (Behavior):
 #            assert any (map (lambda x: x['type'] == 'click', listeners)), listeners
 
         return
-        yield
+        yield # pragma: no cover
 
 @pytest.mark.parametrize("url,selector", clickParam)
 @pytest.mark.asyncio
@@ -77,8 +82,10 @@ async def test_click_selectors (url, selector):
     Make sure the CSS selector exists on an example url
     """
     logger = Logger ()
+    settings = ControllerSettings (idleTimeout=5, timeout=10)
     # Some selectors are loaded dynamically and require scrolling
     controller = SinglePageController (url=url, logger=logger,
+            settings=settings,
             service=Process (),
             behavior=[Scroll, partial(ClickTester, selector=selector)])
     await controller.run ()
@@ -87,12 +94,173 @@ matchParam = []
 for o in sites:
     for s in o['selector']:
         for u in s.get ('urls', []):
-            matchParam.append ((o['match'], u))
+            matchParam.append ((o['match'], URL (u)))
 
 @pytest.mark.parametrize("match,url", matchParam)
 @pytest.mark.asyncio
 async def test_click_match (match, url):
     """ Test urls must match """
-    host = urlparse (url).netloc
-    assert re.match (match, host, re.I)
+    # keep this aligned with click.js
+    assert re.match (match, url.host, re.I)
+
+
+class AccumHandler (EventHandler):
+    """ Test adapter that accumulates all incoming items """
+    __slots__ = ('data')
+
+    def __init__ (self):
+        super().__init__ ()
+        self.data = []
+
+    async def push (self, item):
+        self.data.append (item)
+
+async def simpleServer (url, response):
+    async def f (req):
+        return web.Response (body=response, status=200, content_type='text/html', charset='utf-8')
+
+    app = web.Application ()
+    app.router.add_route ('GET', url.path, f)
+    runner = web.AppRunner(app)
+    await runner.setup()
+    site = web.TCPSite(runner, url.host, url.port)
+    await site.start()
+    return runner
+
+@pytest.mark.asyncio
+async def test_extract_links ():
+    """
+    Make sure the CSS selector exists on an example url
+    """
+
+    url = URL.build (scheme='http', host='localhost', port=8080)
+    runner = await simpleServer (url, """<html><head></head>
+            <body>
+            <div>
+                <a href="/relative">foo</a>
+                <a href="http://example.com/absolute/">foo</a>
+                <a href="https://example.com/absolute/secure">foo</a>
+                <a href="#anchor">foo</a>
+                <a href="http://neue_preise_f%c3%bcr_zahnimplantate_k%c3%b6nnten_sie_%c3%bcberraschen">foo</a>
+
+                <a href="/hidden/visibility" style="visibility: hidden">foo</a>
+                <a href="/hidden/display" style="display: none">foo</a>
+                <div style="display: none">
+                <a href="/hidden/display/insidediv">foo</a>
+                </div>
+                <!--<a href="/hidden/comment">foo</a>-->
+
+                <p><img src="shapes.png" usemap="#shapes">
+                 <map name="shapes"><area shape=rect coords="50,50,100,100" href="/map/rect"></map></p>
+            </div>
+            </body></html>""")
+
+    try:
+        handler = AccumHandler ()
+        logger = Logger ()
+        controller = SinglePageController (url=url, logger=logger,
+                service=Process (), behavior=[ExtractLinks], handler=[handler])
+        await controller.run ()
+
+        links = []
+        for d in handler.data:
+            if isinstance (d, ExtractLinksEvent):
+                links.extend (d.links)
+        assert sorted (links) == sorted ([
+                url.with_path ('/relative'),
+                url.with_fragment ('anchor'),
+                URL ('http://neue_preise_f%C3%BCr_zahnimplantate_k%C3%B6nnten_sie_%C3%BCberraschen'),
+                URL ('http://example.com/absolute/'),
+                URL ('https://example.com/absolute/secure'),
+                url.with_path ('/hidden/visibility'), # XXX: shall we ignore these as well?
+                url.with_path ('/map/rect'),
+                ])
+    finally:
+        await runner.cleanup ()
+
+@pytest.mark.asyncio
+async def test_crash ():
+    """
+    Crashing through Behavior works?
+    """
+
+    url = URL.build (scheme='http', host='localhost', port=8080)
+    runner = await simpleServer (url, '<html></html>')
+
+    try:
+        logger = Logger ()
+        controller = SinglePageController (url=url, logger=logger,
+                service=Process (), behavior=[Crash])
+        with pytest.raises (Crashed):
+            await controller.run ()
+    finally:
+        await runner.cleanup ()
+
+@pytest.mark.asyncio
+async def test_screenshot ():
+    """
+    Make sure screenshots are taken and have the correct dimensions. We can’t
+    and don’t want to check their content.
+    """
+    # ceil(0) == 0, so starting with 1
+    for expectHeight in (1, Screenshot.maxDim, Screenshot.maxDim+1, Screenshot.maxDim*2+Screenshot.maxDim//2):
+        url = URL.build (scheme='http', host='localhost', port=8080)
+        runner = await simpleServer (url, f'<html><body style="margin: 0; padding: 0;"><div style="height: {expectHeight}"></div></body></html>')
+
+        try:
+            handler = AccumHandler ()
+            logger = Logger ()
+            controller = SinglePageController (url=url, logger=logger,
+                    service=Process (), behavior=[Screenshot], handler=[handler])
+            await controller.run ()
+
+            screenshots = list (filter (lambda x: isinstance (x, ScreenshotEvent), handler.data))
+            assert len (screenshots) == math.ceil (expectHeight/Screenshot.maxDim)
+            totalHeight = 0
+            for s in screenshots:
+                assert s.url == url
+                # PNG ident is fixed, IHDR is always the first chunk
+                assert s.data.startswith (b'\x89PNG\r\n\x1a\n\x00\x00\x00\x0dIHDR')
+                width, height = struct.unpack ('>II', s.data[16:24])
+                assert height <= Screenshot.maxDim
+                totalHeight += height
+            # screenshot height is at least canvas height (XXX: get hardcoded
+            # value from devtools.Process)
+            assert totalHeight == max (expectHeight, 1080)
+        finally:
+            await runner.cleanup ()
+
+@pytest.mark.asyncio
+async def test_dom_snapshot ():
+    """
+    Behavior plug-in works, <canvas> is replaced by static image, <script> is
+    stripped. Actual conversion from Chrome DOM to HTML is validated by module
+    .test_html
+    """
+
+    url = URL.build (scheme='http', host='localhost', port=8080)
+    runner = await simpleServer (url, f'<html><body><p>ÄÖÜäöü</p><script>alert("yes");</script><canvas id="canvas" width="1" height="1">Alternate text.</canvas></body></html>')
+
+    try:
+        handler = AccumHandler ()
+        logger = Logger ()
+        controller = SinglePageController (url=url, logger=logger,
+                service=Process (), behavior=[DomSnapshot], handler=[handler])
+        await controller.run ()
+
+        snapshots = list (filter (lambda x: isinstance (x, DomSnapshotEvent), handler.data))
+        assert len (snapshots) == 1
+        doc = snapshots[0].document
+        assert doc.startswith ('<HTML><HEAD><meta charset=utf-8></HEAD><BODY><P>ÄÖÜäöü</P><IMG id=canvas width=1 height=1 src="data:image/png;base64,'.encode ('utf-8'))
+        assert doc.endswith ('></BODY></HTML>'.encode ('utf-8'))
+    finally:
+        await runner.cleanup ()
+
+def test_mapOrIgnore ():
+    def fail (x):
+        if x < 50:
+            raise Exception ()
+        return x+1
+
+    assert list (mapOrIgnore (fail, range (100))) == list (range (51, 101))
 
diff --git a/crocoite/test_browser.py b/crocoite/test_browser.py
index 06492b1..7084214 100644
--- a/crocoite/test_browser.py
+++ b/crocoite/test_browser.py
@@ -18,104 +18,30 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 
-import asyncio
-import pytest
+import asyncio, socket
 from operator import itemgetter
-from aiohttp import web
 from http.server import BaseHTTPRequestHandler
+from datetime import datetime
+
+from yarl import URL
+from aiohttp import web
+from multidict import CIMultiDict
 
-from .browser import Item, SiteLoader, VarChangeEvent
+from hypothesis import given
+import hypothesis.strategies as st
+from hypothesis.provisional import domains
+import pytest
+
+from .browser import RequestResponsePair, SiteLoader, Request, \
+        UnicodeBody, ReferenceTimestamp, Base64Body, UnicodeBody, Request, \
+        Response, NavigateError, PageIdle, FrameNavigated
 from .logger import Logger, Consumer
 from .devtools import Crashed, Process
 
 # if you want to know what’s going on:
+#import logging
 #logging.basicConfig(level=logging.DEBUG)
 
-class TItem (Item):
-    """ This should be as close to Item as possible """
-
-    __slots__ = ('bodySend', '_body', '_requestBody')
-    base = 'http://localhost:8000/'
-
-    def __init__ (self, path, status, headers, bodyReceive, bodySend=None, requestBody=None, failed=False, isRedirect=False):
-        super ().__init__ ()
-        self.chromeResponse = {'response': {'headers': headers, 'status': status, 'url': self.base + path}}
-        self.body = bodyReceive, False
-        self.bodySend = bodyReceive if not bodySend else bodySend
-        self.requestBody = requestBody, False
-        self.failed = failed
-        self.isRedirect = isRedirect
-
-testItems = [
-    TItem ('binary', 200, {'Content-Type': 'application/octet-stream'}, b'\x00\x01\x02', failed=True),
-    TItem ('attachment', 200, 
-            {'Content-Type': 'text/plain; charset=utf-8',
-            'Content-Disposition': 'attachment; filename="attachment.txt"',
-            },
-            'This is a simple text file with umlauts. ÄÖU.'.encode ('utf8'), failed=True),
-    TItem ('encoding/utf8', 200, {'Content-Type': 'text/plain; charset=utf-8'},
-            'This is a test, äöü μνψκ ¥¥¥¿ýý¡'.encode ('utf8')),
-    TItem ('encoding/iso88591', 200, {'Content-Type': 'text/plain; charset=ISO-8859-1'},
-            'This is a test, äöü.'.encode ('utf8'),
-            'This is a test, äöü.'.encode ('ISO-8859-1')),
-    TItem ('encoding/latin1', 200, {'Content-Type': 'text/plain; charset=latin1'},
-            'This is a test, äöü.'.encode ('utf8'),
-            'This is a test, äöü.'.encode ('latin1')),
-    TItem ('image', 200, {'Content-Type': 'image/png'},
-            # 1×1 png image
-            b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x00\x00\x00\x00:~\x9bU\x00\x00\x00\nIDAT\x08\x1dc\xf8\x0f\x00\x01\x01\x01\x006_g\x80\x00\x00\x00\x00IEND\xaeB`\x82'),
-    TItem ('empty', 200, {'Content-Type': 'text/plain'}, b''),
-    TItem ('headers/duplicate', 200, [('Content-Type', 'text/plain'), ('Duplicate', '1'), ('Duplicate', '2')], b''),
-    TItem ('headers/fetch/req', 200, {'Content-Type': 'text/plain'}, b''),
-    TItem ('headers/fetch/html', 200, {'Content-Type': 'text/html'},
-            r"""<html><body><script>
-            let h = new Headers([["custom", "1"]]);
-            fetch("/headers/fetch/req", {"method": "GET", "headers": h}).then(x => console.log("done"));
-            </script></body></html>""".encode ('utf8')),
-    TItem ('redirect/301/empty', 301, {'Location': '/empty'}, b'', isRedirect=True),
-    TItem ('redirect/301/redirect/301/empty', 301, {'Location': '/redirect/301/empty'}, b'', isRedirect=True),
-    TItem ('nonexistent', 404, {}, b''),
-    TItem ('html', 200, {'Content-Type': 'text/html'},
-            '<html><body><img src="/image"><img src="/nonexistent"></body></html>'.encode ('utf8')),
-    TItem ('html/alert', 200, {'Content-Type': 'text/html'},
-            '<html><body><script>window.addEventListener("beforeunload", function (e) { e.returnValue = "bye?"; return e.returnValue; }); alert("stopping here"); if (confirm("are you sure?") || prompt ("42?")) { window.location = "/nonexistent"; }</script><script>document.write(\'<img src="/image">\');</script></body></html>'.encode ('utf8')),
-    TItem ('html/fetchPost', 200, {'Content-Type': 'text/html'},
-            r"""<html><body><script>
-            let a = fetch("/html/fetchPost/binary", {"method": "POST", "body": "\x00"});
-            let b = fetch("/html/fetchPost/form", {"method": "POST", "body": new URLSearchParams({"data": "!"})});
-            let c = fetch("/html/fetchPost/binary/large", {"method": "POST", "body": "\x00".repeat(100*1024)});
-            let d = fetch("/html/fetchPost/form/large", {"method": "POST", "body": new URLSearchParams({"data": "!".repeat(100*1024)})});
-            </script></body></html>""".encode ('utf8')),
-    TItem ('html/fetchPost/binary', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=b'\x00'),
-    TItem ('html/fetchPost/form', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=b'data=%21'),
-    # XXX: these should trigger the need for getRequestPostData, but they don’t. oh well.
-    TItem ('html/fetchPost/binary/large', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=(100*1024)*b'\x00'),
-    TItem ('html/fetchPost/form/large', 200, {'Content-Type': 'application/octet-stream'}, b'\x00', requestBody=b'data=' + (100*1024)*b'%21'),
-    ]
-testItemMap = dict ([(item.parsedUrl.path, item) for item in testItems])
-
-def itemToResponse (item):
-    async def f (req):
-        headers = item.response['headers']
-        return web.Response(body=item.bodySend, status=item.response['status'],
-                headers=headers)
-    return f
-
-@pytest.fixture
-async def server ():
-    """ Simple HTTP server for testing notifications """
-    import logging
-    logging.basicConfig(level=logging.DEBUG)
-    app = web.Application(debug=True)
-    for item in testItems:
-        app.router.add_route ('*', item.parsedUrl.path, itemToResponse (item))
-    runner = web.AppRunner(app)
-    await runner.setup()
-    site = web.TCPSite(runner, 'localhost', 8080)
-    await site.start()
-    yield app
-    await runner.cleanup ()
-
 class AssertConsumer (Consumer):
     def __call__ (self, **kwargs):
         assert 'uuid' in kwargs
@@ -128,164 +54,334 @@ def logger ():
     return Logger (consumer=[AssertConsumer ()])
 
 @pytest.fixture
-async def loader (server, logger):
-    def f (path):
-        if path.startswith ('/'):
-            path = 'http://localhost:8080{}'.format (path)
-        return SiteLoader (browser, path, logger)
-    async with Process () as browser:
-        yield f
-
-async def itemsLoaded (l, items):
-    items = dict ([(i.parsedUrl.path, i) for i in items])
-    async for item in l:
-        assert item.chromeResponse is not None
-        golden = items.pop (item.parsedUrl.path)
-        if not golden:
-            assert False, 'url {} not supposed to be fetched'.format (item.url)
-        assert item.failed == golden.failed
-        if item.failed:
-            # response will be invalid if request failed
-            if not items:
-                break
-            else:
-                continue
-        assert item.isRedirect == golden.isRedirect
-        if golden.isRedirect:
-            assert item.body is None
-        else:
-            assert item.body[0] == golden.body[0]
-        assert item.requestBody[0] == golden.requestBody[0]
-        assert item.response['status'] == golden.response['status']
-        assert item.statusText == BaseHTTPRequestHandler.responses.get (item.response['status'])[0]
-        for k, v in golden.responseHeaders:
-            actual = list (map (itemgetter (1), filter (lambda x: x[0] == k, item.responseHeaders)))
-            assert v in actual
-        
-        # we’re done when everything has been loaded
-        if not items:
-            break
-
-async def literalItem (lf, item, deps=[]):
-    async with lf (item.parsedUrl.path) as l:
-        await l.start ()
-        await asyncio.wait_for (itemsLoaded (l, [item] + deps), timeout=30)
+async def loader (logger):
+    async with Process () as browser, SiteLoader (browser, logger) as l:
+        yield l
 
 @pytest.mark.asyncio
-async def test_empty (loader):
-    await literalItem (loader, testItemMap['/empty'])
+async def test_crash (loader):
+    with pytest.raises (Crashed):
+        await loader.tab.Page.crash ()
 
 @pytest.mark.asyncio
-async def test_headers_duplicate (loader):
-    """
-    Some headers, like Set-Cookie can be present multiple times. Chrome
-    separates these with a newline.
-    """
-    async with loader ('/headers/duplicate') as l:
-        await l.start ()
-        async for it in l:
-            if it.parsedUrl.path == '/headers/duplicate':
-                assert not it.failed
-                dup = list (filter (lambda x: x[0] == 'Duplicate', it.responseHeaders))
-                assert len(dup) == 2
-                assert list(sorted(map(itemgetter(1), dup))) == ['1', '2']
-                break
+async def test_invalidurl (loader):
+    host = 'nonexistent.example'
 
-@pytest.mark.asyncio
-async def test_headers_req (loader):
-    """
-    Custom request headers. JavaScript’s Headers() does not support duplicate
-    headers, so we can’t generate those.
-    """
-    async with loader ('/headers/fetch/html') as l:
-        await l.start ()
-        async for it in l:
-            if it.parsedUrl.path == '/headers/fetch/req':
-                assert not it.failed
-                dup = list (filter (lambda x: x[0] == 'custom', it.requestHeaders))
-                assert len(dup) == 1
-                assert list(sorted(map(itemgetter(1), dup))) == ['1']
-                break
+    # make sure the url does *not* resolve (some DNS intercepting ISP’s mess
+    # with this)
+    loop = asyncio.get_event_loop ()
+    try:
+        resolved = await loop.getaddrinfo (host, None)
+    except socket.gaierror:
+        url = URL.build (scheme='http', host=host)
+        with pytest.raises (NavigateError):
+            await loader.navigate (url)
+    else:
+        pytest.skip (f'host {host} resolved to {resolved}')
 
-@pytest.mark.asyncio
-async def test_redirect (loader):
-    await literalItem (loader, testItemMap['/redirect/301/empty'], [testItemMap['/empty']])
-    # chained redirects
-    await literalItem (loader, testItemMap['/redirect/301/redirect/301/empty'], [testItemMap['/redirect/301/empty'], testItemMap['/empty']])
+timestamp = st.one_of (
+                st.integers(min_value=0, max_value=2**32-1),
+                st.floats (min_value=0, max_value=2**32-1),
+                )
 
-@pytest.mark.asyncio
-async def test_encoding (loader):
-    """ Text responses are transformed to UTF-8. Make sure this works
-    correctly. """
-    for item in {testItemMap['/encoding/utf8'], testItemMap['/encoding/latin1'], testItemMap['/encoding/iso88591']}:
-        await literalItem (loader, item)
+@given(timestamp, timestamp, timestamp)
+def test_referencetimestamp (relativeA, absoluteA, relativeB):
+    ts = ReferenceTimestamp (relativeA, absoluteA)
+    absoluteA = datetime.utcfromtimestamp (absoluteA)
+    absoluteB = ts (relativeB)
+    assert (absoluteA < absoluteB and relativeA < relativeB) or \
+            (absoluteA >= absoluteB and relativeA >= relativeB)
+    assert abs ((absoluteB - absoluteA).total_seconds () - (relativeB - relativeA)) < 10e-6
 
-@pytest.mark.asyncio
-async def test_binary (loader):
-    """ Browser should ignore content it cannot display (i.e. octet-stream) """
-    await literalItem (loader, testItemMap['/binary'])
+def urls ():
+    """ Build http/https URL """
+    scheme = st.sampled_from (['http', 'https'])
+    # Path must start with a slash
+    pathSt = st.builds (lambda x: '/' + x, st.text ())
+    args = st.fixed_dictionaries ({
+            'scheme': scheme,
+            'host': domains (),
+            'port': st.one_of (st.none (), st.integers (min_value=1, max_value=2**16-1)),
+            'path': pathSt,
+            'query_string': st.text (),
+            'fragment': st.text (),
+            })
+    return st.builds (lambda x: URL.build (**x), args)
 
-@pytest.mark.asyncio
-async def test_image (loader):
-    """ Images should be displayed inline """
-    await literalItem (loader, testItemMap['/image'])
+def urlsStr ():
+    return st.builds (lambda x: str (x), urls ())
 
-@pytest.mark.asyncio
-async def test_attachment (loader):
-    """ And downloads won’t work in headless mode, even if it’s just a text file """
-    await literalItem (loader, testItemMap['/attachment'])
+asciiText = st.text (st.characters (min_codepoint=32, max_codepoint=126))
 
-@pytest.mark.asyncio
-async def test_html (loader):
-    await literalItem (loader, testItemMap['/html'], [testItemMap['/image'], testItemMap['/nonexistent']])
-    # make sure alerts are dismissed correctly (image won’t load otherwise)
-    await literalItem (loader, testItemMap['/html/alert'], [testItemMap['/image']])
+def chromeHeaders ():
+    # token as defined by https://tools.ietf.org/html/rfc7230#section-3.2.6
+    token = st.sampled_from('abcdefghijklmnopqrstuvwxyz0123456789!#$%&\'*+-.^_`|~')
+    # XXX: the value should be asciiText without leading/trailing spaces
+    return st.dictionaries (token, token)
 
-@pytest.mark.asyncio
-async def test_post (loader):
-    """ XHR POST request with binary data"""
-    await literalItem (loader, testItemMap['/html/fetchPost'],
-            [testItemMap['/html/fetchPost/binary'],
-            testItemMap['/html/fetchPost/binary/large'],
-            testItemMap['/html/fetchPost/form'],
-            testItemMap['/html/fetchPost/form/large']])
+def fixedDicts (fixed, dynamic):
+    return st.builds (lambda x, y: x.update (y), st.fixed_dictionaries (fixed), st.lists (dynamic))
 
-@pytest.mark.asyncio
-async def test_crash (loader):
-    async with loader ('/html') as l:
-        await l.start ()
-        with pytest.raises (Crashed):
-            await l.tab.Page.crash ()
+def chromeRequestWillBeSent (reqid, url):
+    methodSt = st.sampled_from (['GET', 'POST', 'PUT', 'DELETE'])
+    return st.fixed_dictionaries ({
+            'requestId': reqid,
+            'initiator': st.just ('Test'),
+            'wallTime': timestamp,
+            'timestamp': timestamp,
+            'request': st.fixed_dictionaries ({
+                'url': url,
+                'method': methodSt,
+                'headers': chromeHeaders (),
+                # XXX: postData, hasPostData
+                })
+            })
 
-@pytest.mark.asyncio
-async def test_invalidurl (loader):
-    url = 'http://nonexistent.example/'
-    async with loader (url) as l:
-        await l.start ()
-        async for it in l:
-            assert it.failed
-            break
+def chromeResponseReceived (reqid, url):
+    mimeTypeSt = st.one_of (st.none (), st.just ('text/html'))
+    remoteIpAddressSt = st.one_of (st.none (), st.just ('127.0.0.1'))
+    protocolSt = st.one_of (st.none (), st.just ('h2'))
+    statusCodeSt = st.integers (min_value=100, max_value=999)
+    typeSt = st.sampled_from (['Document', 'Stylesheet', 'Image', 'Media',
+            'Font', 'Script', 'TextTrack', 'XHR', 'Fetch', 'EventSource',
+            'WebSocket', 'Manifest', 'SignedExchange', 'Ping',
+            'CSPViolationReport', 'Other'])
+    return st.fixed_dictionaries ({
+            'requestId': reqid,
+            'timestamp': timestamp,
+            'type': typeSt,
+            'response': st.fixed_dictionaries ({
+                'url': url,
+                'requestHeaders': chromeHeaders (), # XXX: make this optional
+                'headers': chromeHeaders (),
+                'status': statusCodeSt,
+                'statusText': asciiText,
+                'mimeType': mimeTypeSt,
+                'remoteIPAddress': remoteIpAddressSt,
+                'protocol': protocolSt,
+                })
+            })
+
+def chromeReqResp ():
+    # XXX: will this gnerated the same url for all testcases?
+    reqid = st.shared (st.text (), 'reqresp')
+    url = st.shared (urlsStr (), 'reqresp')
+    return st.tuples (chromeRequestWillBeSent (reqid, url),
+            chromeResponseReceived (reqid, url))
+
+def requestResponsePair ():
+    def f (creq, cresp, hasPostData, reqBody, respBody):
+        i = RequestResponsePair ()
+        i.fromRequestWillBeSent (creq)
+        i.request.hasPostData = hasPostData
+        if hasPostData:
+            i.request.body = reqBody
+
+        if cresp is not None:
+            i.fromResponseReceived (cresp)
+            if respBody is not None:
+                i.response.body = respBody
+        return i
+
+    bodySt = st.one_of (
+            st.none (),
+            st.builds (UnicodeBody, st.text ()),
+            st.builds (Base64Body.fromBytes, st.binary ())
+            )
+    return st.builds (lambda reqresp, hasPostData, reqBody, respBody:
+            f (reqresp[0], reqresp[1], hasPostData, reqBody, respBody),
+            chromeReqResp (), st.booleans (), bodySt, bodySt)
+
+@given(chromeReqResp ())
+def test_requestResponsePair (creqresp):
+    creq, cresp = creqresp
+
+    item = RequestResponsePair ()
+
+    assert item.id is None
+    assert item.url is None
+    assert item.request is None
+    assert item.response is None
+
+    item.fromRequestWillBeSent (creq)
+
+    assert item.id == creq['requestId']
+    url = URL (creq['request']['url'])
+    assert item.url == url
+    assert item.request is not None
+    assert item.request.timestamp == datetime.utcfromtimestamp (creq['wallTime'])
+    assert set (item.request.headers.keys ()) == set (creq['request']['headers'].keys ())
+    assert item.response is None
+
+    item.fromResponseReceived (cresp)
+
+    # url will not be overwritten
+    assert item.id == creq['requestId'] == cresp['requestId']
+    assert item.url == url
+    assert item.request is not None
+    assert set (item.request.headers.keys ()) == set (cresp['response']['requestHeaders'].keys ())
+    assert item.response is not None
+    assert set (item.response.headers.keys ()) == set (cresp['response']['headers'].keys ())
+    assert (item.response.timestamp - item.request.timestamp).total_seconds () - \
+            (cresp['timestamp'] - creq['timestamp']) < 10e-6
+
+@given(chromeReqResp ())
+def test_requestResponsePair_eq (creqresp):
+    creq, cresp = creqresp
+
+    item = RequestResponsePair ()
+    item2 = RequestResponsePair ()
+    assert item == item
+    assert item == item2
+
+    item.fromRequestWillBeSent (creq)
+    assert item != item2
+    item2.fromRequestWillBeSent (creq)
+    assert item == item
+    assert item == item2
+
+    item.fromResponseReceived (cresp)
+    assert item != item2
+    item2.fromResponseReceived (cresp)
+    assert item == item
+    assert item == item2
+
+    # XXX: test for inequality with different parameters
+
+### Google Chrome integration tests ###
+
+serverUrl = URL.build (scheme='http', host='localhost', port=8080)
+items = [
+    RequestResponsePair (
+        url=serverUrl.with_path ('/encoding/utf-8'),
+        request=Request (method='GET'),
+        response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=utf-8')]),
+            body=UnicodeBody ('äöü'), mimeType='text/html')
+        ),
+    RequestResponsePair (
+        url=serverUrl.with_path ('/encoding/latin1'),
+        request=Request (method='GET'),
+        response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=latin1')]),
+            body=UnicodeBody ('äöü'), mimeType='text/html')
+        ),
+    RequestResponsePair (
+        url=serverUrl.with_path ('/encoding/utf-16'),
+        request=Request (method='GET'),
+        response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=utf-16')]),
+            body=UnicodeBody ('äöü'), mimeType='text/html')
+        ),
+    RequestResponsePair (
+        url=serverUrl.with_path ('/encoding/ISO-8859-1'),
+        request=Request (method='GET'),
+        response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=ISO-8859-1')]),
+            body=UnicodeBody ('äöü'), mimeType='text/html')
+        ),
+    RequestResponsePair (
+        url=serverUrl.with_path ('/status/200'),
+        request=Request (method='GET'),
+        response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/plain')]),
+            body=b'',
+            mimeType='text/plain'),
+        ),
+    # redirects never have a response body
+    RequestResponsePair (
+        url=serverUrl.with_path ('/status/301'),
+        request=Request (method='GET'),
+        response=Response (status=301,
+            headers=CIMultiDict ([('Content-Type', 'text/plain'),
+                ('Location', str (serverUrl.with_path ('/status/301/redirected')))]),
+            body=None,
+            mimeType='text/plain'),
+        ),
+    RequestResponsePair (
+        url=serverUrl.with_path ('/image/png'),
+        request=Request (method='GET'),
+        response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'image/png')]),
+            body=Base64Body.fromBytes (b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01\x08\x00\x00\x00\x00:~\x9bU\x00\x00\x00\nIDAT\x08\x1dc\xf8\x0f\x00\x01\x01\x01\x006_g\x80\x00\x00\x00\x00IEND\xaeB`\x82'),
+            mimeType='image/png'),
+        ),
+    RequestResponsePair (
+        url=serverUrl.with_path ('/script/alert'),
+        request=Request (method='GET'),
+        response=Response (status=200, headers=CIMultiDict ([('Content-Type', 'text/html; charset=utf-8')]),
+            body=UnicodeBody ('''<html><body><script>
+window.addEventListener("beforeunload", function (e) {
+    e.returnValue = "bye?";
+    return e.returnValue;
+});
+alert("stopping here");
+if (confirm("are you sure?") || prompt ("42?")) {
+    window.location = "/nonexistent";
+}
+</script></body></html>'''), mimeType='text/html')
+        ),
+    ]
 
 @pytest.mark.asyncio
-async def test_varchangeevent ():
-    e = VarChangeEvent (True)
-    assert e.get () == True
-
-    # no change at all
-    w = asyncio.ensure_future (e.wait ())
-    finished, pending = await asyncio.wait ([w], timeout=0.1)
-    assert not finished and pending
-
-    # no change
-    e.set (True)
-    finished, pending = await asyncio.wait ([w], timeout=0.1)
-    assert not finished and pending
-
-    # changed
-    e.set (False)
-    await asyncio.sleep (0.1) # XXX: is there a yield() ?
-    assert w.done ()
-    ret = w.result ()
-    assert ret == False
-    assert e.get () == ret
+# would be nice if we could use hypothesis here somehow
+@pytest.mark.parametrize("golden", items)
+async def test_integration_item (loader, golden):
+    async def f (req):
+        body = golden.response.body
+        contentType = golden.response.headers.get ('content-type', '') if golden.response.headers is not None else ''
+        charsetOff = contentType.find ('charset=')
+        if isinstance (body, UnicodeBody) and charsetOff != -1:
+            encoding = contentType[charsetOff+len ('charset='):]
+            body = golden.response.body.decode ('utf-8').encode (encoding)
+        return web.Response (body=body, status=golden.response.status,
+                headers=golden.response.headers)
+
+    app = web.Application ()
+    app.router.add_route (golden.request.method, golden.url.path, f)
+    runner = web.AppRunner(app)
+    await runner.setup()
+    site = web.TCPSite(runner, serverUrl.host, serverUrl.port)
+    try:
+        await site.start()
+    except Exception as e:
+        pytest.skip (e)
+
+    haveReqResp = False
+    haveNavigated = False
+    try:
+        await loader.navigate (golden.url)
+
+        it = loader.__aiter__ ()
+        while True:
+            try:
+                item = await asyncio.wait_for (it.__anext__ (), timeout=1)
+            except asyncio.TimeoutError:
+                break
+            # XXX: can only check the first req/resp right now (due to redirect)
+            if isinstance (item, RequestResponsePair) and not haveReqResp:
+                # we do not know this in advance
+                item.request.initiator = None
+                item.request.headers = None
+                item.remoteIpAddress = None
+                item.protocol = None
+                item.resourceType = None
+
+                if item.response:
+                    assert item.response.statusText is not None
+                    item.response.statusText = None
+
+                    del item.response.headers['server']
+                    del item.response.headers['content-length']
+                    del item.response.headers['date']
+                assert item == golden
+                haveReqResp = True
+            elif isinstance (item, FrameNavigated):
+                # XXX: can’t check this, because of the redirect
+                #assert item.url == golden.url
+                haveNavigated = True
+    finally:
+        assert haveReqResp
+        assert haveNavigated
+        await runner.cleanup ()
+
+def test_page_idle ():
+    for v in (True, False):
+        idle = PageIdle (v)
+        assert bool (idle) == v
+
 
diff --git a/crocoite/test_controller.py b/crocoite/test_controller.py
new file mode 100644
index 0000000..7216a42
--- /dev/null
+++ b/crocoite/test_controller.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2017–2018 crocoite contributors
+# 
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+# 
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+# 
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+
+import asyncio
+
+from yarl import URL
+from aiohttp import web
+
+import pytest
+
+from .logger import Logger
+from .controller import ControllerSettings, SinglePageController, SetEntry, \
+        IdleStateTracker
+from .browser import PageIdle
+from .devtools import Process
+from .test_browser import loader
+
+@pytest.mark.asyncio
+async def test_controller_timeout ():
+    """ Make sure the controller terminates, even if the site keeps reloading/fetching stuff """
+
+    async def f (req):
+        return web.Response (body="""<html>
+<body>
+<p>hello</p>
+<script>
+window.setTimeout (function () { window.location = '/' }, 250);
+window.setInterval (function () { fetch('/').then (function (e) { console.log (e) }) }, 150);
+</script>
+</body>
+</html>""", status=200, content_type='text/html', charset='utf-8')
+
+    url = URL.build (scheme='http', host='localhost', port=8080)
+    app = web.Application ()
+    app.router.add_route ('GET', '/', f)
+    runner = web.AppRunner(app)
+    await runner.setup()
+    site = web.TCPSite(runner, url.host, url.port)
+    await site.start()
+
+    loop = asyncio.get_event_loop ()
+    try:
+        logger = Logger ()
+        settings = ControllerSettings (idleTimeout=1, timeout=5)
+        controller = SinglePageController (url=url, logger=logger,
+                service=Process (), behavior=[], settings=settings)
+        # give the controller a little more time to finish, since there are
+        # hard-coded asyncio.sleep calls in there right now.
+        # XXX fix this
+        before = loop.time ()
+        await asyncio.wait_for (controller.run (), timeout=settings.timeout*2)
+        after = loop.time ()
+        assert after-before >= settings.timeout, (settings.timeout*2, after-before)
+    finally:
+        # give the browser some time to close before interrupting the
+        # connection by destroying the HTTP server
+        await asyncio.sleep (1)
+        await runner.cleanup ()
+
+@pytest.mark.asyncio
+async def test_controller_idle_timeout ():
+    """ Make sure the controller terminates, even if the site keeps reloading/fetching stuff """
+
+    async def f (req):
+        return web.Response (body="""<html>
+<body>
+<p>hello</p>
+<script>
+window.setInterval (function () { fetch('/').then (function (e) { console.log (e) }) }, 2000);
+</script>
+</body>
+</html>""", status=200, content_type='text/html', charset='utf-8')
+
+    url = URL.build (scheme='http', host='localhost', port=8080)
+    app = web.Application ()
+    app.router.add_route ('GET', '/', f)
+    runner = web.AppRunner(app)
+    await runner.setup()
+    site = web.TCPSite(runner, url.host, url.port)
+    await site.start()
+
+    loop = asyncio.get_event_loop ()
+    try:
+        logger = Logger ()
+        settings = ControllerSettings (idleTimeout=1, timeout=60)
+        controller = SinglePageController (url=url, logger=logger,
+                service=Process (), behavior=[], settings=settings)
+        before = loop.time ()
+        await asyncio.wait_for (controller.run (), settings.timeout*2)
+        after = loop.time ()
+        assert settings.idleTimeout <= after-before <= settings.idleTimeout*2+3
+    finally:
+        await runner.cleanup ()
+
+def test_set_entry ():
+    a = SetEntry (1, a=2, b=3)
+    assert a == a
+    assert hash (a) == hash (a)
+
+    b = SetEntry (1, a=2, b=4)
+    assert a == b
+    assert hash (a) == hash (b)
+
+    c = SetEntry (2, a=2, b=3)
+    assert a != c
+    assert hash (a) != hash (c)
+
+@pytest.mark.asyncio
+async def test_idle_state_tracker ():
+    # default is idle
+    loop = asyncio.get_event_loop ()
+    idle = IdleStateTracker (loop)
+    assert idle._idle
+
+    # idle change
+    await idle.push (PageIdle (False))
+    assert not idle._idle
+
+    # nothing happens for other objects
+    await idle.push ({})
+    assert not idle._idle
+
+    # no state change -> wait does not return
+    with pytest.raises (asyncio.TimeoutError):
+        await asyncio.wait_for (idle.wait (0.1), timeout=1)
+
+    # wait at least timeout
+    delta = 0.2
+    timeout = 1
+    await idle.push (PageIdle (True))
+    assert idle._idle
+    start = loop.time ()
+    await idle.wait (timeout)
+    end = loop.time ()
+    assert (timeout-delta) < (end-start) < (timeout+delta)
+
+@pytest.fixture
+async def recordingServer ():
+    """ Simple HTTP server that records raw requests """
+    url = URL ('http://localhost:8080')
+    reqs = []
+    async def record (request):
+        reqs.append (request)
+        return web.Response(text='ok', content_type='text/plain')
+    app = web.Application()
+    app.add_routes([web.get(url.path, record)])
+    runner = web.AppRunner(app)
+    await runner.setup()
+    site = web.TCPSite (runner, url.host, url.port)
+    await site.start()
+    yield url, reqs
+    await runner.cleanup ()
+
+from .test_devtools import tab, browser
+from http.cookies import Morsel, SimpleCookie
+
+@pytest.mark.asyncio
+async def test_set_cookies (tab, recordingServer):
+    """ Make sure cookies are set properly and only affect the domain they were
+    set for """
+
+    logger = Logger ()
+
+    url, reqs = recordingServer
+
+    cookies = []
+    c = Morsel ()
+    c.set ('foo', 'bar', '')
+    c['domain'] = 'localhost'
+    cookies.append (c)
+    c = Morsel ()
+    c.set ('buz', 'beef', '')
+    c['domain'] = 'nonexistent.example'
+
+    settings = ControllerSettings (idleTimeout=1, timeout=60, cookies=cookies)
+    controller = SinglePageController (url=url, logger=logger,
+            service=Process (), behavior=[], settings=settings)
+    await asyncio.wait_for (controller.run (), settings.timeout*2)
+    
+    assert len (reqs) == 1
+    req = reqs[0]
+    reqCookies = SimpleCookie (req.headers['cookie'])
+    assert len (reqCookies) == 1
+    c = next (iter (reqCookies.values ()))
+    assert c.key == cookies[0].key
+    assert c.value == cookies[0].value
diff --git a/crocoite/test_devtools.py b/crocoite/test_devtools.py
index 74d223f..bd1a828 100644
--- a/crocoite/test_devtools.py
+++ b/crocoite/test_devtools.py
@@ -24,7 +24,8 @@ import pytest
 from aiohttp import web
 import websockets
 
-from .devtools import Browser, Tab, MethodNotFound, Crashed, InvalidParameter, Process, Passthrough
+from .devtools import Browser, Tab, MethodNotFound, Crashed, \
+        InvalidParameter, Process, Passthrough
 
 @pytest.fixture
 async def browser ():
@@ -38,8 +39,9 @@ async def tab (browser):
         # make sure there are no transactions left over (i.e. no unawaited requests)
         assert not tab.transactions
 
+docBody = "<html><body><p>Hello, world</p></body></html>"
 async def hello(request):
-    return web.Response(text="Hello, world")
+    return web.Response(text=docBody, content_type='text/html')
 
 @pytest.fixture
 async def server ():
@@ -73,8 +75,10 @@ async def test_tab_close (browser):
 
 @pytest.mark.asyncio
 async def test_tab_notify_enable_disable (tab):
-    """ Make sure enabling/disabling notifications works for all known namespaces """
-    for name in ('Debugger', 'DOM', 'Log', 'Network', 'Page', 'Performance', 'Profiler', 'Runtime', 'Security'):
+    """ Make sure enabling/disabling notifications works for all known
+    namespaces """
+    for name in ('Debugger', 'DOM', 'Log', 'Network', 'Page', 'Performance',
+            'Profiler', 'Runtime', 'Security'):
         f = getattr (tab, name)
         await f.enable ()
         await f.disable ()
@@ -109,14 +113,45 @@ async def test_tab_crash (tab):
 async def test_load (tab, server):
     await tab.Network.enable ()
     await tab.Page.navigate (url='http://localhost:8080')
-    method, req = await tab.get ()
-    assert method == tab.Network.requestWillBeSent
-    method, resp = await tab.get ()
-    assert method == tab.Network.responseReceived
-    assert tab.pending == 0
-    body = await tab.Network.getResponseBody (requestId=req['requestId'])
-    assert body['body'] == "Hello, world"
+
+    haveRequest = False
+    haveResponse = False
+    haveData = False
+    haveFinished = False
+    haveBody = False
+    req = None
+    resp = None
+    while not haveBody:
+        method, data = await tab.get ()
+
+        # it can be either of those two in no specified order
+        if method in (tab.Network.requestWillBeSent, tab.Network.requestWillBeSentExtraInfo) and not haveResponse:
+            if req is None:
+                req = data
+            assert data['requestId'] == req['requestId']
+            haveRequest = True
+        elif method in (tab.Network.responseReceived, tab.Network.responseReceivedExtraInfo) and haveRequest:
+            if resp is None:
+                resp = data
+            assert data['requestId'] == resp['requestId']
+            haveResponse = True
+        elif haveRequest and haveResponse and method == tab.Network.dataReceived:
+            assert data['dataLength'] == len (docBody)
+            assert data['requestId'] == req['requestId']
+            haveData = True
+        elif haveData:
+            assert method == tab.Network.loadingFinished
+            assert data['requestId'] == req['requestId']
+            haveBody = True
+        elif haveFinished:
+            body = await tab.Network.getResponseBody (requestId=req['requestId'])
+            assert body['body'] == docBody
+            haveBody = True
+        else:
+            assert False, (method, req)
+
     await tab.Network.disable ()
+    assert tab.pending == 0
 
 @pytest.mark.asyncio
 async def test_recv_failure(browser):
@@ -149,7 +184,8 @@ async def test_tab_function (tab):
 
 @pytest.mark.asyncio
 async def test_tab_function_hash (tab):
-    d = {tab.Network.enable: 1, tab.Network.disable: 2, tab.Page: 3, tab.Page.enable: 4}
+    d = {tab.Network.enable: 1, tab.Network.disable: 2, tab.Page: 3,
+            tab.Page.enable: 4}
     assert len (d) == 4
 
 @pytest.mark.asyncio
@@ -168,5 +204,5 @@ async def test_passthrough ():
 
     url = 'http://localhost:12345'
     async with Passthrough (url) as u:
-        assert u == url
+        assert str (u) == url
 
diff --git a/crocoite/test_html.py b/crocoite/test_html.py
index c71697a..c17903b 100644
--- a/crocoite/test_html.py
+++ b/crocoite/test_html.py
@@ -18,9 +18,11 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 
+import asyncio
 import pytest, html5lib
 from html5lib.serializer import HTMLSerializer
 from html5lib.treewalkers import getTreeWalker
+from aiohttp import web
 
 from .html import StripTagFilter, StripAttributeFilter, ChromeTreeWalker
 from .test_devtools import tab, browser
@@ -58,3 +60,37 @@ async def test_treewalker (tab):
         elif i == 1:
             assert result == framehtml
 
+cdataDoc = '<test><![CDATA[Hello world]]></test>'
+xmlHeader = '<?xml version="1.0" encoding="UTF-8"?>'
+async def hello(request):
+    return web.Response(text=xmlHeader + cdataDoc, content_type='text/xml')
+
+@pytest.fixture
+async def server ():
+    """ Simple HTTP server for testing notifications """
+    app = web.Application()
+    app.add_routes([web.get('/test.xml', hello)])
+    runner = web.AppRunner(app)
+    await runner.setup()
+    site = web.TCPSite(runner, 'localhost', 8080)
+    await site.start()
+    yield app
+    await runner.cleanup ()
+
+@pytest.mark.asyncio
+async def test_treewalker_cdata (tab, server):
+    ret = await tab.Page.navigate (url='http://localhost:8080/test.xml')
+    # wait until loaded XXX: replace with idle check
+    await asyncio.sleep (0.5)
+    dom = await tab.DOM.getDocument (depth=-1, pierce=True)
+    docs = list (ChromeTreeWalker (dom['root']).split ())
+    assert len(docs) == 1
+    for i, doc in enumerate (docs):
+        walker = ChromeTreeWalker (doc)
+        serializer = HTMLSerializer ()
+        result = serializer.render (iter(walker))
+        # chrome will display a pretty-printed viewer *plus* the original
+        # source (stripped of its xml header)
+        assert cdataDoc in result
+
+
diff --git a/crocoite/test_irc.py b/crocoite/test_irc.py
index 4d80a6d..9344de4 100644
--- a/crocoite/test_irc.py
+++ b/crocoite/test_irc.py
@@ -19,7 +19,7 @@
 # THE SOFTWARE.
 
 import pytest
-from .irc import ArgparseBot, RefCountEvent
+from .irc import ArgparseBot, RefCountEvent, User, NickMode
 
 def test_mode_parse ():
     assert ArgparseBot.parseMode ('+a') == [('+', 'a')]
@@ -51,3 +51,20 @@ def test_refcountevent_arm_with (event):
         event.arm ()
         assert not event.event.is_set ()
     assert event.event.is_set ()
+
+def test_nick_mode ():
+    a = User.fromName ('a')
+    a2 = User.fromName ('a')
+    a3 = User.fromName ('+a')
+    b = User.fromName ('+b')
+    c = User.fromName ('@c')
+
+    # equality is based on name only, not mode
+    assert a == a2
+    assert a == a3
+    assert a != b
+
+    assert a.hasPriv (None) and not a.hasPriv (NickMode.voice) and not a.hasPriv (NickMode.operator)
+    assert b.hasPriv (None) and b.hasPriv (NickMode.voice) and not b.hasPriv (NickMode.operator)
+    assert c.hasPriv (None) and c.hasPriv (NickMode.voice) and c.hasPriv (NickMode.operator)
+
diff --git a/crocoite/test_logger.py b/crocoite/test_logger.py
index 3af1321..26e420a 100644
--- a/crocoite/test_logger.py
+++ b/crocoite/test_logger.py
@@ -80,3 +80,12 @@ def test_datetime (logger):
     ret = logger.debug()
     assert 'date' in ret
 
+def test_independence ():
+    """ Make sure two instances are completely independent """
+    l1 = Logger ()
+    c = QueueConsumer ()
+    l1.connect (c)
+    l2 = Logger ()
+    l2.info (nothing='nothing')
+    assert not c.data
+
diff --git a/crocoite/test_tools.py b/crocoite/test_tools.py
index 947d020..416b954 100644
--- a/crocoite/test_tools.py
+++ b/crocoite/test_tools.py
@@ -25,9 +25,9 @@ import pytest
 from warcio.archiveiterator import ArchiveIterator
 from warcio.warcwriter import WARCWriter
 from warcio.statusandheaders import StatusAndHeaders
+from pkg_resources import parse_version
 
-from .tools import mergeWarc
-from .util import packageUrl
+from .tools import mergeWarc, Errata, FixableErrata
 
 @pytest.fixture
 def writer():
@@ -48,9 +48,11 @@ def recordsEqual(golden, underTest):
 
 def makeGolden(writer, records):
     # additional warcinfo is written. Content does not matter.
-    record = writer.create_warc_record (packageUrl ('warcinfo'), 'warcinfo',
+    record = writer.create_warc_record (
+            '',
+            'warcinfo',
             payload=b'',
-            warc_headers_dict={'Content-Type': 'text/plain; encoding=utf-8'})
+            warc_headers_dict={'Content-Type': 'application/json; charset=utf-8'})
     records.insert (0, record)
     return records
 
@@ -96,7 +98,7 @@ def test_different_payload(writer):
 
         httpHeaders = StatusAndHeaders('200 OK', {}, protocol='HTTP/1.1')
         record = writer.create_warc_record ('http://example.com/', 'response',
-                payload=BytesIO('data{}'.format(i).encode ('utf8')),
+                payload=BytesIO(f'data{i}'.encode ('utf8')),
                 warc_headers_dict=warcHeaders, http_headers=httpHeaders)
         records.append (record)
 
@@ -195,3 +197,28 @@ def test_resp_revisit_other_url(writer):
     output.seek(0)
     recordsEqual (makeGolden (writer, records), ArchiveIterator (output))
 
+def test_errata_contains():
+    """ Test version matching """
+    e = Errata('some-uuid', 'description', ['a<1.0'])
+    assert {'a': parse_version('0.1')} in e
+    assert {'a': parse_version('1.0')} not in e
+    assert {'b': parse_version('1.0')} not in e
+
+    e = Errata('some-uuid', 'description', ['a<1.0,>0.1'])
+    assert {'a': parse_version('0.1')} not in e
+    assert {'a': parse_version('0.2')} in e
+    assert {'a': parse_version('1.0')} not in e
+
+    # a AND b
+    e = Errata('some-uuid', 'description', ['a<1.0', 'b>1.0'])
+    assert {'a': parse_version('0.1')} not in e
+    assert {'b': parse_version('1.1')} not in e
+    assert {'a': parse_version('0.1'), 'b': parse_version('1.1')} in e
+
+def test_errata_fixable ():
+    e = Errata('some-uuid', 'description', ['a<1.0', 'b>1.0'])
+    assert not e.fixable
+
+    e = FixableErrata('some-uuid', 'description', ['a<1.0', 'b>1.0'])
+    assert e.fixable
+
diff --git a/crocoite/test_warc.py b/crocoite/test_warc.py
new file mode 100644
index 0000000..3ec310c
--- /dev/null
+++ b/crocoite/test_warc.py
@@ -0,0 +1,225 @@
+# Copyright (c) 2018 crocoite contributors
+# 
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+# 
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+# 
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+
+from tempfile import NamedTemporaryFile
+import json, urllib
+from operator import itemgetter
+
+from warcio.archiveiterator import ArchiveIterator
+from yarl import URL
+from multidict import CIMultiDict
+from hypothesis import given, reproduce_failure
+import hypothesis.strategies as st
+import pytest
+
+from .warc import WarcHandler
+from .logger import Logger, WarcHandlerConsumer
+from .controller import ControllerStart
+from .behavior import Script, ScreenshotEvent, DomSnapshotEvent
+from .browser import RequestResponsePair, Base64Body, UnicodeBody
+from .test_browser import requestResponsePair, urls
+
+def test_log ():
+    logger = Logger ()
+
+    with NamedTemporaryFile() as fd:
+        with WarcHandler (fd, logger) as handler:
+            warclogger = WarcHandlerConsumer (handler)
+            logger.connect (warclogger)
+            golden = []
+
+            assert handler.log.tell () == 0
+            golden.append (logger.info (foo=1, bar='baz', encoding='äöü⇔ΓΨ'))
+            assert handler.log.tell () != 0
+
+            handler.maxLogSize = 0
+            golden.append (logger.info (bar=1, baz='baz'))
+            # should flush the log
+            assert handler.log.tell () == 0
+
+        fd.seek (0)
+        for it in ArchiveIterator (fd):
+            headers = it.rec_headers
+            assert headers['warc-type'] == 'metadata'
+            assert 'warc-target-uri' not in headers
+            assert headers['x-crocoite-type'] == 'log'
+            assert headers['content-type'] == f'application/json; charset={handler.logEncoding}'
+
+            while True:
+                l = it.raw_stream.readline ()
+                if not l:
+                    break
+                data = json.loads (l.strip ())
+                assert data == golden.pop (0)
+
+def jsonObject ():
+    """ JSON-encodable objects """
+    return st.dictionaries (st.text (), st.one_of (st.integers (), st.text ()))
+
+def viewport ():
+    return st.builds (lambda x, y: f'{x}x{y}', st.integers (), st.integers ())
+
+def event ():
+    return st.one_of (
+            st.builds (ControllerStart, jsonObject ()),
+            st.builds (Script.fromStr, st.text (), st.one_of(st.none (), st.text ())),
+            st.builds (ScreenshotEvent, urls (), st.integers (), st.binary ()),
+            st.builds (DomSnapshotEvent, urls (), st.builds (lambda x: x.encode ('utf-8'), st.text ()), viewport()),
+            requestResponsePair (),
+            )
+
+@pytest.mark.asyncio
+@given (st.lists (event ()))
+async def test_push (golden):
+    def checkWarcinfoId (headers):
+        if lastWarcinfoRecordid is not None:
+            assert headers['WARC-Warcinfo-ID'] == lastWarcinfoRecordid
+
+    lastWarcinfoRecordid = None
+
+    # null logger
+    logger = Logger ()
+    with open('/tmp/test.warc.gz', 'w+b') as fd:
+        with WarcHandler (fd, logger) as handler:
+            for g in golden:
+                await handler.push (g)
+
+        fd.seek (0)
+        it = iter (ArchiveIterator (fd))
+        for g in golden:
+            if isinstance (g, ControllerStart):
+                rec = next (it)
+
+                headers = rec.rec_headers
+                assert headers['warc-type'] == 'warcinfo'
+                assert 'warc-target-uri' not in headers
+                assert 'x-crocoite-type' not in headers
+
+                data = json.load (rec.raw_stream)
+                assert data == g.payload
+
+                lastWarcinfoRecordid = headers['warc-record-id']
+                assert lastWarcinfoRecordid
+            elif isinstance (g, Script):
+                rec = next (it)
+
+                headers = rec.rec_headers
+                assert headers['warc-type'] == 'resource'
+                assert headers['content-type'] == 'application/javascript; charset=utf-8'
+                assert headers['x-crocoite-type'] == 'script'
+                checkWarcinfoId (headers)
+                if g.path:
+                    assert URL (headers['warc-target-uri']) == URL ('file://' + g.abspath)
+                else:
+                    assert 'warc-target-uri' not in headers
+
+                data = rec.raw_stream.read ().decode ('utf-8')
+                assert data == g.data
+            elif isinstance (g, ScreenshotEvent):
+                # XXX: check refers-to header
+                rec = next (it)
+
+                headers = rec.rec_headers
+                assert headers['warc-type'] == 'conversion'
+                assert headers['x-crocoite-type'] == 'screenshot'
+                checkWarcinfoId (headers)
+                assert URL (headers['warc-target-uri']) == g.url, (headers['warc-target-uri'], g.url)
+                assert headers['warc-refers-to'] is None
+                assert int (headers['X-Crocoite-Screenshot-Y-Offset']) == g.yoff
+
+                assert rec.raw_stream.read () == g.data
+            elif isinstance (g, DomSnapshotEvent):
+                rec = next (it)
+
+                headers = rec.rec_headers
+                assert headers['warc-type'] == 'conversion'
+                assert headers['x-crocoite-type'] == 'dom-snapshot'
+                checkWarcinfoId (headers)
+                assert URL (headers['warc-target-uri']) == g.url
+                assert headers['warc-refers-to'] is None
+
+                assert rec.raw_stream.read () == g.document
+            elif isinstance (g, RequestResponsePair):
+                rec = next (it)
+
+                # request
+                headers = rec.rec_headers
+                assert headers['warc-type'] == 'request'
+                assert 'x-crocoite-type' not in headers
+                checkWarcinfoId (headers)
+                assert URL (headers['warc-target-uri']) == g.url
+                assert headers['x-chrome-request-id'] == g.id
+                
+                assert CIMultiDict (rec.http_headers.headers) == g.request.headers
+                if g.request.hasPostData:
+                    if g.request.body is not None:
+                        assert rec.raw_stream.read () == g.request.body
+                    else:
+                        # body fetch failed
+                        assert headers['warc-truncated'] == 'unspecified'
+                        assert not rec.raw_stream.read ()
+                else:
+                    assert not rec.raw_stream.read ()
+
+                # response
+                if g.response:
+                    rec = next (it)
+                    headers = rec.rec_headers
+                    httpheaders = rec.http_headers
+                    assert headers['warc-type'] == 'response'
+                    checkWarcinfoId (headers)
+                    assert URL (headers['warc-target-uri']) == g.url
+                    assert headers['x-chrome-request-id'] == g.id
+                    assert 'x-crocoite-type' not in headers
+
+                    # these are checked separately
+                    filteredHeaders = CIMultiDict (httpheaders.headers)
+                    for b in {'content-type', 'content-length'}:
+                        if b in g.response.headers:
+                            g.response.headers.popall (b)
+                        if b in filteredHeaders:
+                            filteredHeaders.popall (b)
+                    assert filteredHeaders == g.response.headers
+
+                    expectedContentType = g.response.mimeType
+                    if expectedContentType is not None:
+                        assert httpheaders['content-type'].startswith (expectedContentType)
+
+                    if g.response.body is not None:
+                        assert rec.raw_stream.read () == g.response.body
+                        assert httpheaders['content-length'] == str (len (g.response.body))
+                        # body is never truncated if it exists
+                        assert headers['warc-truncated'] is None
+
+                        # unencoded strings are converted to utf8
+                        if isinstance (g.response.body, UnicodeBody) and httpheaders['content-type'] is not None:
+                            assert httpheaders['content-type'].endswith ('; charset=utf-8')
+                    else:
+                        # body fetch failed
+                        assert headers['warc-truncated'] == 'unspecified'
+                        assert not rec.raw_stream.read ()
+                        # content-length header should be kept intact
+            else:
+                assert False, f"invalid golden type {type(g)}" # pragma: no cover
+
+        # no further records
+        with pytest.raises (StopIteration):
+            next (it)
+
diff --git a/crocoite/tools.py b/crocoite/tools.py
index e2dc6a7..a2ddaa3 100644
--- a/crocoite/tools.py
+++ b/crocoite/tools.py
@@ -24,13 +24,23 @@ Misc tools
 
 import shutil, sys, os, logging, argparse, json
 from io import BytesIO
+
 from warcio.archiveiterator import ArchiveIterator
 from warcio.warcwriter import WARCWriter
-from .util import packageUrl, getSoftwareInfo
+from yarl import URL
+
+from pkg_resources import parse_version, parse_requirements
+
+from .util import getSoftwareInfo, StrJsonEncoder
+from .warc import jsonMime, makeContentType
 
 def mergeWarc (files, output):
+    # stats
     unique = 0
     revisit = 0
+    uniqueLength = 0
+    revisitLength = 0
+
     payloadMap = {}
     writer = WARCWriter (output, gzip=True)
 
@@ -48,9 +58,9 @@ def mergeWarc (files, output):
             'parameters': {'inputs': files},
             }
     payload = BytesIO (json.dumps (warcinfo, indent=2).encode ('utf-8'))
-    record = writer.create_warc_record (packageUrl ('warcinfo'), 'warcinfo',
+    record = writer.create_warc_record ('', 'warcinfo',
             payload=payload,
-            warc_headers_dict={'Content-Type': 'text/plain; encoding=utf-8'})
+            warc_headers_dict={'Content-Type': makeContentType (jsonMime, 'utf-8')})
     writer.write_record (record)
 
     for l in files:
@@ -60,13 +70,15 @@ def mergeWarc (files, output):
                     headers = record.rec_headers
                     rid = headers.get_header('WARC-Record-ID')
                     csum = headers.get_header('WARC-Payload-Digest')
+                    length = int (headers.get_header ('Content-Length'))
                     dup = payloadMap.get (csum, None)
                     if dup is None:
                         payloadMap[csum] = {'uri': headers.get_header('WARC-Target-URI'),
                                 'id': rid, 'date': headers.get_header('WARC-Date')}
                         unique += 1
+                        uniqueLength += length
                     else:
-                        logging.debug ('Record {} is duplicate of {}'.format (rid, dup['id']))
+                        logging.debug (f'Record {rid} is duplicate of {dup["id"]}')
                         # Payload may be identical, but HTTP headers are
                         # (probably) not. Include them.
                         record = writer.create_revisit_record (
@@ -76,10 +88,21 @@ def mergeWarc (files, output):
                         record.rec_headers.add_header ('WARC-Truncated', 'length')
                         record.rec_headers.add_header ('WARC-Refers-To', dup['id'])
                         revisit += 1
+                        revisitLength += length
                 else:
                     unique += 1
                 writer.write_record (record)
-    logging.info ('Wrote {} unique records, {} revisits'.format (unique, revisit))
+    json.dump (dict (
+            unique=dict (records=unique, bytes=uniqueLength),
+            revisit=dict (records=revisit, bytes=revisitLength),
+            ratio=dict (
+                    records=unique/(unique+revisit),
+                    bytes=uniqueLength/(uniqueLength+revisitLength)
+                    ),
+            ),
+            sys.stdout,
+            cls=StrJsonEncoder)
+    sys.stdout.write ('\n')
 
 def mergeWarcCli():
     parser = argparse.ArgumentParser(description='Merge WARCs, reads filenames from stdin.')
@@ -97,13 +120,19 @@ def extractScreenshot ():
     Extract page screenshots from a WARC generated by crocoite into files
     """
 
-    parser = argparse.ArgumentParser(description='Extract screenshots.')
-    parser.add_argument('-f', '--force', action='store_true', help='Overwrite existing files')
-    parser.add_argument('input', type=argparse.FileType ('rb'), help='Input WARC')
+    parser = argparse.ArgumentParser(description='Extract screenshots from '
+            'WARC, write JSON info to stdout.')
+    parser.add_argument('-f', '--force', action='store_true',
+            help='Overwrite existing files')
+    parser.add_argument('-1', '--one', action='store_true',
+            help='Only extract the first screenshot into a file named prefix')
+    parser.add_argument('input', type=argparse.FileType ('rb'),
+            help='Input WARC')
     parser.add_argument('prefix', help='Output file prefix')
 
     args = parser.parse_args()
 
+    i = 0
     with args.input:
         for record in ArchiveIterator (args.input):
             headers = record.rec_headers
@@ -112,13 +141,177 @@ def extractScreenshot ():
                     'X-Crocoite-Screenshot-Y-Offset' not in headers:
                 continue
 
-            urlSanitized = headers.get_header('WARC-Target-URI').replace ('/', '_')
-            xoff = 0
+            url = URL (headers.get_header ('WARC-Target-URI'))
             yoff = int (headers.get_header ('X-Crocoite-Screenshot-Y-Offset'))
-            outpath = '{}-{}-{}-{}.png'.format (args.prefix, urlSanitized, xoff, yoff)
+            outpath = f'{args.prefix}{i:05d}.png' if not args.one else args.prefix
             if args.force or not os.path.exists (outpath):
+                json.dump ({'file': outpath, 'url': url, 'yoff': yoff},
+                        sys.stdout, cls=StrJsonEncoder)
+                sys.stdout.write ('\n')
                 with open (outpath, 'wb') as out:
                     shutil.copyfileobj (record.raw_stream, out)
+                i += 1
             else:
-                print ('not overwriting {}'.format (outpath))
+                print (f'not overwriting {outpath}', file=sys.stderr)
+
+            if args.one:
+                break
+
+class Errata:
+    __slots__ = ('uuid', 'description', 'url', 'affects')
+
+    def __init__ (self, uuid, description, affects, url=None):
+        self.uuid = uuid
+        self.description = description
+        self.url = url
+        # slightly abusing setuptool’s version parsing/matching here
+        self.affects = list (parse_requirements(affects))
+
+    def __contains__ (self, pkg):
+        """
+        Return True if the versions in pkg are affected by this errata
+
+        pkg must be a mapping from project_name to version
+        """
+        matchedAll = []
+        for a in self.affects:
+            haveVersion = pkg.get (a.project_name, None)
+            matchedAll.append (haveVersion is not None and haveVersion in a)
+        return all (matchedAll)
+
+    def __repr__ (self):
+        return f'{self.__class__.__name__}({self.uuid!r}, {self.description!r}, {self.affects!r})'
+
+    @property
+    def fixable (self):
+        return getattr (self, 'applyFix', None) is not None
+
+    def toDict (self):
+        return {'uuid': self.uuid,
+                'description': self.description,
+                'url': self.url,
+                'affects': list (map (str, self.affects)),
+                'fixable': self.fixable}
+
+class FixableErrata(Errata):
+    __slots__ = ('stats')
+
+    def __init__ (self, uuid, description, affects, url=None):
+        super().__init__ (uuid, description, affects, url)
+        # statistics for fixable erratas
+        self.stats = dict (records=dict (fixed=0, processed=0))
+
+    def applyFix (self, record):
+        raise NotImplementedError () # pragma: no cover
+
+class ContentTypeErrata (FixableErrata):
+    def __init__ (self):
+        super().__init__ (
+            uuid='552c13dc-56e5-4539-9ad8-184ccae60930',
+            description='Content-Type header uses wrong argument name encoding instead of charset.',
+            url='https://github.com/PromyLOPh/crocoite/issues/19',
+            affects=['crocoite==1.0.0'])
+
+    def applyFix (self, record):
+        # XXX: this is ugly. warcio’s write_record replaces any Content-Type
+        # header we’re setting with this one. But printing rec_headers shows
+        # the header, not .content_type.
+        contentType = record.content_type
+        if '; encoding=' in contentType:
+            contentType = contentType.replace ('; encoding=', '; charset=')
+            record.content_type = contentType
+            self.stats['records']['fixed'] += 1
+
+        self.stats['records']['processed'] += 1
+        return record
+
+bugs = [
+    Errata (uuid='34a176b3-ad3d-430f-a082-68087f304572',
+            description='Generated by version < 1.0. No erratas are supported for this version.',
+            affects=['crocoite<1.0'],
+            ),
+    ContentTypeErrata (),
+    ]
+
+def makeReport (fd):
+    alreadyFixed = set ()
+
+    for record in ArchiveIterator (fd):
+        if record.rec_type == 'warcinfo':
+            try:
+                data = json.load (record.raw_stream)
+                # errata records precceed everything else and indicate which
+                # ones were fixed already
+                if data['tool'] == 'crocoite-errata':
+                    alreadyFixed.update (data['parameters']['errata'])
+                else:
+                    haveVersions = dict ([(pkg['projectName'], parse_version(pkg['version'])) for pkg in data['software']['self']])
+                    yield from filter (lambda b: haveVersions in b and b.uuid not in alreadyFixed, bugs)
+            except json.decoder.JSONDecodeError:
+                pass
+
+def errataCheck (args):
+    hasErrata = False
+    for item in makeReport (args.input):
+        json.dump (item.toDict (), sys.stdout)
+        sys.stdout.write ('\n')
+        sys.stdout.flush ()
+        hasErrata = True
+    return int (hasErrata)
+
+def errataFix (args):
+    errata = args.errata
+
+    with args.input as infd, args.output as outfd:
+        writer = WARCWriter (outfd, gzip=True)
+
+        warcinfo = {
+                'software': getSoftwareInfo (),
+                'tool': 'crocoite-errata', # not the name of the cli tool
+                'parameters': {'errata': [errata.uuid]},
+                }
+        payload = BytesIO (json.dumps (warcinfo, indent=2).encode ('utf-8'))
+        record = writer.create_warc_record ('', 'warcinfo',
+                payload=payload,
+                warc_headers_dict={'Content-Type': makeContentType (jsonMime, 'utf-8')})
+        writer.write_record (record)
+
+        for record in ArchiveIterator (infd):
+            fixedRecord = errata.applyFix (record)
+            writer.write_record (fixedRecord)
+    json.dump (errata.stats, sys.stdout)
+    sys.stdout.write ('\n')
+    sys.stdout.flush ()
+
+def uuidToErrata (uuid, onlyFixable=True):
+    try:
+        e = next (filter (lambda x: x.uuid == uuid, bugs))
+    except StopIteration:
+        raise argparse.ArgumentTypeError (f'Errata {uuid} does not exist')
+    if not isinstance (e, FixableErrata):
+        raise argparse.ArgumentTypeError (f'Errata {uuid} is not fixable')
+    return e
+
+def errata ():
+    parser = argparse.ArgumentParser(description=f'Show/fix erratas for WARCs generated by {__package__}.')
+    parser.add_argument('input', metavar='INPUT', type=argparse.FileType ('rb'), help='Input WARC')
+
+    # XXX: required argument does not work here?!
+    subparsers = parser.add_subparsers()
+
+    checkparser = subparsers.add_parser('check', help='Show erratas')
+    checkparser.set_defaults (func=errataCheck)
+
+    fixparser = subparsers.add_parser('fix', help='Fix erratas')
+    fixparser.add_argument('errata', metavar='UUID', type=uuidToErrata, help='Apply fix for this errata')
+    fixparser.add_argument('output', metavar='OUTPUT', type=argparse.FileType ('wb'), help='Output WARC')
+    fixparser.set_defaults (func=errataFix)
+
+    args = parser.parse_args()
+
+    if not hasattr (args, 'func'):
+        parser.print_usage ()
+        parser.exit ()
+
+    return args.func (args)
 
diff --git a/crocoite/util.py b/crocoite/util.py
index bd26909..da377a3 100644
--- a/crocoite/util.py
+++ b/crocoite/util.py
@@ -22,26 +22,30 @@
 Random utility functions
 """
 
-import random, sys, platform
+import random, sys, platform, os, json, urllib
+from datetime import datetime
 import hashlib, pkg_resources
-from urllib.parse import urlsplit, urlunsplit
 
-def packageUrl (path):
-    """
-    Create URL for package data stored into WARC
-    """
-    return 'urn:' + __package__ + ':' + path
+from yarl import URL
+
+class StrJsonEncoder (json.JSONEncoder):
+    """ JSON encoder that turns unknown classes into a string and thus never
+    fails """
+    def default (self, obj):
+        if isinstance (obj, datetime):
+            return obj.isoformat ()
+
+        # make sure serialization always succeeds
+        try:
+            return json.JSONEncoder.default(self, obj)
+        except TypeError:
+            return str (obj)
 
 async def getFormattedViewportMetrics (tab):
     layoutMetrics = await tab.Page.getLayoutMetrics ()
     # XXX: I’m not entirely sure which one we should use here
-    return '{}x{}'.format (layoutMetrics['layoutViewport']['clientWidth'],
-                layoutMetrics['layoutViewport']['clientHeight'])
-
-def removeFragment (u):
-    """ Remove fragment from url (i.e. #hashvalue) """
-    s = urlsplit (u)
-    return urlunsplit ((s.scheme, s.netloc, s.path, s.query, ''))
+    viewport = layoutMetrics['layoutViewport']
+    return f"{viewport['clientWidth']}x{viewport['clientHeight']}"
 
 def getSoftwareInfo ():
     """ Get software info for inclusion into warcinfo """
@@ -79,7 +83,7 @@ def getRequirements (dist):
             pkg = getattr (m, '__package__', None)
             # is loaded?
             if pkg in modules:
-                if f:
+                if f and os.path.isfile (f):
                     with open (f, 'rb') as fd:
                         contents = fd.read ()
                         h = hashlib.new ('sha512')
diff --git a/crocoite/warc.py b/crocoite/warc.py
index ebc460d..415b487 100644
--- a/crocoite/warc.py
+++ b/crocoite/warc.py
@@ -24,24 +24,36 @@ Classes writing data to WARC files
 
 import json, threading
 from io import BytesIO
-from urllib.parse import urlsplit
 from datetime import datetime
+from http.server import BaseHTTPRequestHandler
 
 from warcio.timeutils import datetime_to_iso_date
 from warcio.warcwriter import WARCWriter
 from warcio.statusandheaders import StatusAndHeaders
+from yarl import URL
 
-from .util import packageUrl
+from .util import StrJsonEncoder
 from .controller import EventHandler, ControllerStart
 from .behavior import Script, DomSnapshotEvent, ScreenshotEvent
-from .browser import Item
+from .browser import RequestResponsePair, UnicodeBody
+
+# the official mimetype for json, according to https://tools.ietf.org/html/rfc8259
+jsonMime = 'application/json'
+# mime for javascript, according to https://tools.ietf.org/html/rfc4329#section-7.2
+jsMime = 'application/javascript'
+
+def makeContentType (mime, charset=None):
+    """ Create value of Content-Type WARC header with optional charset """
+    s = [mime]
+    if charset:
+        s.extend (['; charset=', charset])
+    return ''.join (s)
 
 class WarcHandler (EventHandler):
     __slots__ = ('logger', 'writer', 'documentRecords', 'log',
             'maxLogSize', 'logEncoding', 'warcinfoRecordId')
 
-    def __init__ (self, fd,
-            logger):
+    def __init__ (self, fd, logger):
         self.logger = logger
         self.writer = WARCWriter (fd, gzip=True)
 
@@ -68,6 +80,7 @@ class WarcHandler (EventHandler):
 
         Adds default WARC headers.
         """
+        assert url is None or isinstance (url, URL)
 
         d = {}
         if self.warcinfoRecordId:
@@ -75,8 +88,11 @@ class WarcHandler (EventHandler):
         d.update (warc_headers_dict)
         warc_headers_dict = d
 
-        record = self.writer.create_warc_record (url, kind, payload=payload,
-                warc_headers_dict=warc_headers_dict, http_headers=http_headers)
+        record = self.writer.create_warc_record (str (url) if url else '',
+                kind,
+                payload=payload,
+                warc_headers_dict=warc_headers_dict,
+                http_headers=http_headers)
         self.writer.write_record (record)
 
         return record
@@ -85,72 +101,52 @@ class WarcHandler (EventHandler):
         logger = self.logger.bind (reqId=item.id)
 
         req = item.request
-        resp = item.response
-        url = urlsplit (resp['url'])
-
-        path = url.path
-        if url.query:
-            path += '?' + url.query
-        httpHeaders = StatusAndHeaders('{} {} HTTP/1.1'.format (req['method'], path),
-                item.requestHeaders, protocol='HTTP/1.1', is_http_request=True)
-        initiator = item.initiator
+        url = item.url
+
+        path = url.relative().with_fragment(None)
+        httpHeaders = StatusAndHeaders(f'{req.method} {path} HTTP/1.1',
+                req.headers, protocol='HTTP/1.1', is_http_request=True)
         warcHeaders = {
-                'X-Chrome-Initiator': json.dumps (initiator),
+                # required to correlate request with log entries
                 'X-Chrome-Request-ID': item.id,
-                'WARC-Date': datetime_to_iso_date (datetime.utcfromtimestamp (item.chromeRequest['wallTime'])),
+                'WARC-Date': datetime_to_iso_date (req.timestamp),
                 }
 
-        if item.requestBody is not None:
-            payload, payloadBase64Encoded = item.requestBody
-        else:
+        body = item.request.body
+        if item.request.hasPostData and body is None:
             # oops, don’t know what went wrong here
-            logger.error ('requestBody missing', uuid='ee9adc58-e723-4595-9feb-312a67ead6a0')
+            logger.error ('requestBody missing',
+                    uuid='ee9adc58-e723-4595-9feb-312a67ead6a0')
             warcHeaders['WARC-Truncated'] = 'unspecified'
-            payload = None
-
-        if payload:
-            payload = BytesIO (payload)
-            warcHeaders['X-Chrome-Base64Body'] = str (payloadBase64Encoded)
-        record = self.writeRecord (req['url'], 'request',
-                payload=payload, http_headers=httpHeaders,
+        else:
+            body = BytesIO (body)
+        record = self.writeRecord (url, 'request',
+                payload=body, http_headers=httpHeaders,
                 warc_headers_dict=warcHeaders)
         return record.rec_headers['WARC-Record-ID']
 
     def _writeResponse (self, item, concurrentTo):
         # fetch the body
         reqId = item.id
-        rawBody = None
-        base64Encoded = False
-        bodyTruncated = None
-        if item.isRedirect or item.body is None:
-            # redirects reuse the same request, thus we cannot safely retrieve
-            # the body (i.e getResponseBody may return the new location’s
-            # body). No body available means we failed to retrieve it.
-            bodyTruncated = 'unspecified'
-        else:
-            rawBody, base64Encoded = item.body
 
         # now the response
         resp = item.response
         warcHeaders = {
                 'WARC-Concurrent-To': concurrentTo,
-                'WARC-IP-Address': resp.get ('remoteIPAddress', ''),
-                'X-Chrome-Protocol': resp.get ('protocol', ''),
-                'X-Chrome-FromDiskCache': str (resp.get ('fromDiskCache')),
-                'X-Chrome-ConnectionReused': str (resp.get ('connectionReused')),
+                # required to correlate request with log entries
                 'X-Chrome-Request-ID': item.id,
-                'WARC-Date': datetime_to_iso_date (datetime.utcfromtimestamp (
-                        item.chromeRequest['wallTime']+
-                        (item.chromeResponse['timestamp']-item.chromeRequest['timestamp']))),
+                'WARC-Date': datetime_to_iso_date (resp.timestamp),
                 }
-        if bodyTruncated:
-            warcHeaders['WARC-Truncated'] = bodyTruncated
-        else:
-            warcHeaders['X-Chrome-Base64Body'] = str (base64Encoded)
+        # conditional WARC headers
+        if item.remoteIpAddress:
+            warcHeaders['WARC-IP-Address'] = item.remoteIpAddress
 
-        httpHeaders = StatusAndHeaders('{} {}'.format (resp['status'],
-                item.statusText), item.responseHeaders,
-                protocol='HTTP/1.1')
+        # HTTP headers
+        statusText = resp.statusText or \
+                BaseHTTPRequestHandler.responses.get (
+                resp.status, ('No status text available', ))[0]
+        httpHeaders = StatusAndHeaders(f'{resp.status} {statusText}',
+                resp.headers, protocol='HTTP/1.1')
 
         # Content is saved decompressed and decoded, remove these headers
         blacklistedHeaders = {'transfer-encoding', 'content-encoding'}
@@ -160,20 +156,21 @@ class WarcHandler (EventHandler):
         # chrome sends nothing but utf8 encoded text. Fortunately HTTP
         # headers take precedence over the document’s <meta>, thus we can
         # easily override those.
-        contentType = resp.get ('mimeType')
-        if contentType:
-            if not base64Encoded:
-                contentType += '; charset=utf-8'
-            httpHeaders.replace_header ('content-type', contentType)
-
-        if rawBody is not None:
-            httpHeaders.replace_header ('content-length', '{:d}'.format (len (rawBody)))
-            bodyIo = BytesIO (rawBody)
+        if resp.mimeType:
+            charset = 'utf-8' if isinstance (resp.body, UnicodeBody) else None
+            contentType = makeContentType (resp.mimeType, charset=charset)
+            httpHeaders.replace_header ('Content-Type', contentType)
+
+        # response body
+        body = resp.body
+        if body is None:
+            warcHeaders['WARC-Truncated'] = 'unspecified'
         else:
-            bodyIo = BytesIO ()
+            httpHeaders.replace_header ('Content-Length', str (len (body)))
+            body = BytesIO (body)
 
-        record = self.writeRecord (resp['url'], 'response',
-                warc_headers_dict=warcHeaders, payload=bodyIo,
+        record = self.writeRecord (item.url, 'response',
+                warc_headers_dict=warcHeaders, payload=body,
                 http_headers=httpHeaders)
 
         if item.resourceType == 'Document':
@@ -182,32 +179,38 @@ class WarcHandler (EventHandler):
     def _writeScript (self, item):
         writer = self.writer
         encoding = 'utf-8'
-        self.writeRecord (packageUrl ('script/{}'.format (item.path)), 'metadata',
+        # XXX: yes, we’re leaking information about the user here, but this is
+        # the one and only source URL of the scripts.
+        uri = URL(f'file://{item.abspath}') if item.path else None
+        self.writeRecord (uri, 'resource',
                 payload=BytesIO (str (item).encode (encoding)),
-                warc_headers_dict={'Content-Type': 'application/javascript; charset={}'.format (encoding)})
+                warc_headers_dict={
+                    'Content-Type': makeContentType (jsMime, encoding),
+                    'X-Crocoite-Type': 'script',
+                    })
 
     def _writeItem (self, item):
-        if item.failed:
-            # should have been handled by the logger already
-            return
-
+        assert item.request
         concurrentTo = self._writeRequest (item)
-        self._writeResponse (item, concurrentTo)
+        # items that failed loading don’t have a response
+        if item.response:
+            self._writeResponse (item, concurrentTo)
 
     def _addRefersTo (self, headers, url):
         refersTo = self.documentRecords.get (url)
         if refersTo:
             headers['WARC-Refers-To'] = refersTo
         else:
-            self.logger.error ('No document record found for {}'.format (url))
+            self.logger.error (f'No document record found for {url}')
         return headers
 
     def _writeDomSnapshot (self, item):
         writer = self.writer
 
-        warcHeaders = {'X-DOM-Snapshot': str (True),
+        warcHeaders = {
+                'X-Crocoite-Type': 'dom-snapshot',
                 'X-Chrome-Viewport': item.viewport,
-                'Content-Type': 'text/html; charset=utf-8',
+                'Content-Type': makeContentType ('text/html', 'utf-8')
                 }
 
         self._addRefersTo (warcHeaders, item.url)
@@ -218,53 +221,53 @@ class WarcHandler (EventHandler):
 
     def _writeScreenshot (self, item):
         writer = self.writer
-        warcHeaders = {'Content-Type': 'image/png',
-                'X-Crocoite-Screenshot-Y-Offset': str (item.yoff)}
+        warcHeaders = {
+                'Content-Type': makeContentType ('image/png'),
+                'X-Crocoite-Screenshot-Y-Offset': str (item.yoff),
+                'X-Crocoite-Type': 'screenshot',
+                }
         self._addRefersTo (warcHeaders, item.url)
         self.writeRecord (item.url, 'conversion',
                 payload=BytesIO (item.data), warc_headers_dict=warcHeaders)
 
-    def _writeControllerStart (self, item):
-        payload = BytesIO (json.dumps (item.payload, indent=2).encode ('utf-8'))
+    def _writeControllerStart (self, item, encoding='utf-8'):
+        payload = BytesIO (json.dumps (item.payload, indent=2, cls=StrJsonEncoder).encode (encoding))
 
         writer = self.writer
-        warcinfo = self.writeRecord (packageUrl ('warcinfo'), 'warcinfo',
-                warc_headers_dict={'Content-Type': 'text/plain; encoding=utf-8'},
+        warcinfo = self.writeRecord (None, 'warcinfo',
+                warc_headers_dict={'Content-Type': makeContentType (jsonMime, encoding)},
                 payload=payload)
         self.warcinfoRecordId = warcinfo.rec_headers['WARC-Record-ID']
 
     def _flushLogEntries (self):
-        writer = self.writer
-        self.log.seek (0)
-        # XXX: we should use the type continuation here
-        self.writeRecord (packageUrl ('log'), 'resource', payload=self.log,
-                warc_headers_dict={'Content-Type': 'text/plain; encoding={}'.format (self.logEncoding)})
-        self.log = BytesIO ()
+        if self.log.tell () > 0:
+            writer = self.writer
+            self.log.seek (0)
+            warcHeaders = {
+                    'Content-Type': makeContentType (jsonMime, self.logEncoding),
+                    'X-Crocoite-Type': 'log',
+                    }
+            self.writeRecord (None, 'metadata', payload=self.log,
+                    warc_headers_dict=warcHeaders)
+            self.log = BytesIO ()
 
     def _writeLog (self, item):
         """ Handle log entries, called by .logger.WarcHandlerConsumer only """
         self.log.write (item.encode (self.logEncoding))
         self.log.write (b'\n')
-        # instead of locking, check we’re running in the main thread
-        if self.log.tell () > self.maxLogSize and \
-                threading.current_thread () is threading.main_thread ():
+        if self.log.tell () > self.maxLogSize:
             self._flushLogEntries ()
 
     route = {Script: _writeScript,
-            Item: _writeItem,
+            RequestResponsePair: _writeItem,
             DomSnapshotEvent: _writeDomSnapshot,
             ScreenshotEvent: _writeScreenshot,
             ControllerStart: _writeControllerStart,
             }
 
-    def push (self, item):
-        processed = False
+    async def push (self, item):
         for k, v in self.route.items ():
             if isinstance (item, k):
                 v (self, item)
-                processed = True
                 break
 
-        if not processed:
-            self.logger.debug ('unknown event {}'.format (repr (item)))
-
diff --git a/doc/_ext/clicklist.py b/doc/_ext/clicklist.py
new file mode 100644
index 0000000..a69452c
--- /dev/null
+++ b/doc/_ext/clicklist.py
@@ -0,0 +1,45 @@
+"""
+Render click.yaml config file into human-readable list of supported sites
+"""
+
+import pkg_resources, yaml
+from docutils import nodes
+from docutils.parsers.rst import Directive
+from yarl import URL
+
+class ClickList (Directive):
+    def run(self):
+        # XXX: do this once only
+        fd = pkg_resources.resource_stream ('crocoite', 'data/click.yaml')
+        config = list (yaml.safe_load_all (fd))
+
+        l = nodes.definition_list ()
+        for site in config:
+            urls = set ()
+            v = nodes.definition ()
+            vl = nodes.bullet_list ()
+            v += vl
+            for s in site['selector']:
+                i = nodes.list_item ()
+                i += nodes.paragraph (text=s['description'])
+                vl += i
+                urls.update (map (lambda x: URL(x).with_path ('/'), s.get ('urls', [])))
+
+            item = nodes.definition_list_item ()
+            term = ', '.join (map (lambda x: x.host, urls)) if urls else site['match']
+            k = nodes.term (text=term)
+            item += k
+
+            item += v
+            l += item
+        return [l]
+
+def setup(app):
+    app.add_directive ("clicklist", ClickList)
+
+    return {
+        'version': '0.1',
+        'parallel_read_safe': True,
+        'parallel_write_safe': True,
+    }
+
diff --git a/doc/conf.py b/doc/conf.py
new file mode 100644
index 0000000..8336c27
--- /dev/null
+++ b/doc/conf.py
@@ -0,0 +1,44 @@
+# -*- coding: utf-8 -*-
+import os, sys
+
+# -- Project information -----------------------------------------------------
+
+project = 'crocoite'
+copyright = '2019 crocoite contributors'
+author = 'crocoite contributors'
+
+# -- General configuration ---------------------------------------------------
+
+sys.path.append(os.path.abspath("./_ext"))
+extensions = [
+    'sphinx.ext.viewcode',
+    'sphinx.ext.autodoc',
+    'clicklist',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+source_suffix = '.rst'
+master_doc = 'index'
+language = 'en'
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+pygments_style = 'tango'
+
+# -- Options for HTML output -------------------------------------------------
+
+html_theme = 'alabaster'
+html_theme_options = {
+    "description": "Preservation for the modern web",
+    "github_user": "PromyLOPh",
+    "github_repo": "crocoite",
+    "travis_button": True,
+    "github_button": True,
+    "codecov_button": True,
+    "fixed_sidebar": True,
+}
+#html_static_path = ['_static']
+html_sidebars = {
+   '**': ['about.html', 'navigation.html', 'searchbox.html'],
+}
+
diff --git a/doc/develop.rst b/doc/develop.rst
new file mode 100644
index 0000000..801ab21
--- /dev/null
+++ b/doc/develop.rst
@@ -0,0 +1,39 @@
+Development
+-----------
+
+Generally crocoite provides reasonable defaults for Google Chrome via
+:py:mod:`crocoite.devtools`. When debugging this software it might be necessary
+to open a non-headless instance of the browser by running
+
+.. code:: bash
+
+   google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs
+
+and then passing the option :option:`--browser=http://localhost:9222` to
+:program:`crocoite-single`. This allows human intervention through the
+browser’s builtin console.
+
+Release guide
+^^^^^^^^^^^^^
+
+crocoite uses `semantic versioning`_. To create a new release, bump the version
+number in ``setup.py`` according to the linked guide, create distribution
+packages::
+
+    python setup.py sdist bdist_wheel
+
+Verify them::
+
+    twine check dist/*
+
+Try to install and use them in a separate sandbox. And finally sign and upload
+a new version to pypi_::
+
+    gpg --detach-sign --armor dist/*.tar.gz
+    twine upload dist/*
+
+Then update the documentation using :program:`sphing-doc` and upload it as well.
+
+.. _semantic versioning: https://semver.org/spec/v2.0.0.html
+.. _pypi: https://pypi.org
+
diff --git a/doc/index.rst b/doc/index.rst
new file mode 100644
index 0000000..53f5f77
--- /dev/null
+++ b/doc/index.rst
@@ -0,0 +1,36 @@
+crocoite
+========
+
+Preservation for the modern web, powered by `headless Google
+Chrome`_.
+
+.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
+
+.. toctree::
+   :maxdepth: 1
+   :hidden:
+
+   usage.rst
+   plugins.rst
+   rationale.rst
+   develop.rst
+   related.rst
+
+Features
+--------
+
+Google Chrome-powered
+    HTML renderer, JavaScript engine and network stack, supporting modern web
+    technologies and protocols
+WARC output
+    Includes all network requests made by the browser
+Site interaction
+    :ref:`Auto-expand on-click content <click>`, infinite-scrolling
+DOM snapshot
+    Contains the page’s state, renderable without JavaScript
+Image screenshot
+    Entire page
+Machine-readable interface
+    Easy integration into custom tools/scripts
+
+
diff --git a/doc/plugins.rst b/doc/plugins.rst
new file mode 100644
index 0000000..062e1bf
--- /dev/null
+++ b/doc/plugins.rst
@@ -0,0 +1,16 @@
+Plugins
+=======
+
+crocoite comes with plug-ins that modify loaded sites’ or interact with them.
+
+.. _click:
+
+click
+-----
+
+The following sites are currently supported. Note this is an ongoing
+battle against layout changes and thus older software versions will stop
+working very soon.
+
+.. clicklist::
+
diff --git a/doc/rationale.rst b/doc/rationale.rst
new file mode 100644
index 0000000..f37db7c
--- /dev/null
+++ b/doc/rationale.rst
@@ -0,0 +1,76 @@
+Rationale
+---------
+
+Most modern websites depend heavily on executing code, usually JavaScript, on
+the user’s machine. They also make use of new and emerging Web technologies
+like HTML5, WebSockets, service workers and more. Even worse from the
+preservation point of view, they also require some form of user interaction to
+dynamically load more content (infinite scrolling, dynamic comment loading,
+etc).
+
+The naive approach of fetching a HTML page, parsing it and extracting
+links to referenced resources therefore is not sufficient to create a faithful
+snapshot of these web applications. A full browser, capable of running scripts and
+providing modern Web API’s is absolutely required for this task. Thankfully
+Google Chrome runs without a display (headless mode) and can be controlled by
+external programs, allowing them to navigate and extract or inject data.
+This section describes the solutions crocoite offers and explains design
+decisions taken.
+
+crocoite captures resources by listening to Chrome’s `network events`_ and
+requesting the response body using `Network.getResponseBody`_. This approach
+has caveats: The original HTTP requests and responses, as sent over the wire,
+are not available. They are reconstructed from parsed data. The character
+encoding for text documents is changed to UTF-8. And the content body of HTTP
+redirects cannot be retrieved due to a race condition.
+
+.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
+.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
+
+But at the same time it allows crocoite to rely on Chrome’s well-tested network
+stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
+transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
+need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
+traffic and present a fake certificate to the browser in order to store the
+transmitted content.
+
+.. _warcprox: https://github.com/internetarchive/warcprox
+
+WARC records generated by crocoite therefore are an abstract view on the
+resource they represent and not necessarily the data sent over the wire. A URL
+fetched with HTTP/2 for example will still result in a HTTP/1.1
+request/response pair in the WARC file. This may be undesireable from
+an archivist’s point of view (“save the data exactly like we received it”). But
+this level of abstraction is inevitable when dealing with more than one
+protocol.
+
+crocoite also interacts with and therefore alters the grabbed websites. It does
+so by injecting `behavior scripts`_ into the site. Typically these are written
+in JavaScript, because interacting with a page is easier this way. These
+scripts then perform different tasks: Extracting targets from visible
+hyperlinks, clicking buttons or scrolling the website to to load more content,
+as well as taking a static screenshot of ``<canvas>`` elements for the DOM
+snapshot (see below).
+
+.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
+
+Replaying archived WARC’s can be quite challenging and might not be possible
+with current technology (or even at all):
+
+- Some sites request assets based on screen resolution, pixel ratio and
+  supported image formats (webp). Replaying those with different parameters
+  won’t work, since assets for those are missing. Example: missguided.com.
+- Some fetch different scripts based on user agent. Example: youtube.com.
+- Requests containing randomly generated JavaScript callback function names
+  won’t work. Example: weather.com.
+- Range requests (Range: bytes=1-100) are captured as-is, making playback
+  difficult
+
+crocoite offers two methods to work around these issues. Firstly it can save a
+DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
+``<script>`` tags after the site has been fully loaded and thus can be
+displayed without executing scripts.  Obviously JavaScript-based navigation
+does not work any more. Secondly it also saves a screenshot of the full page,
+so even if future browsers cannot render and display the stored HTML a fully
+rendered version of the website can be replayed instead.
+
diff --git a/doc/related.rst b/doc/related.rst
new file mode 100644
index 0000000..62e2569
--- /dev/null
+++ b/doc/related.rst
@@ -0,0 +1,14 @@
+Related projects
+----------------
+
+brozzler_
+    Uses Google Chrome as well, but intercepts traffic using a proxy. Supports
+    distributed crawling and immediate playback.
+Squidwarc_
+    Communicates with headless Google Chrome and uses the Network API to
+    retrieve requests like crocoite. Supports recursive crawls and page
+    scrolling, but neither custom JavaScript nor distributed crawling.
+
+.. _brozzler: https://github.com/internetarchive/brozzler
+.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc
+
diff --git a/doc/usage.rst b/doc/usage.rst
new file mode 100644
index 0000000..34a3e7b
--- /dev/null
+++ b/doc/usage.rst
@@ -0,0 +1,162 @@
+Usage
+-----
+
+Quick start using pywb_, expects Google Chrome to be installed already:
+
+.. code:: bash
+
+    pip install crocoite pywb
+    crocoite http://example.com/ example.com.warc.gz
+    wb-manager init test && wb-manager add test example.com.warc.gz
+    wayback &
+    $BROWSER http://localhost:8080
+
+.. _pywb: https://github.com/ikreymer/pywb
+
+It is recommended to install at least Micrsoft’s Corefonts_ as well as DejaVu_,
+Liberation_ or a similar font family covering a wide range of character sets.
+Otherwise page screenshots may be unusable due to missing glyphs.
+
+.. _Corefonts: http://corefonts.sourceforge.net/
+.. _DejaVu: https://dejavu-fonts.github.io/
+.. _Liberation: https://pagure.io/liberation-fonts
+
+Recursion
+^^^^^^^^^
+
+.. program:: crocoite
+
+By default crocoite will only retrieve the URL specified on the command line.
+However it can follow links as well. There’s currently two recursion strategies
+available, depth- and prefix-based.
+
+.. code:: bash
+
+   crocoite -r 1 https://example.com/ example.com.warc.gz
+
+will retrieve ``example.com`` and all pages directly refered to by it.
+Increasing the number increases the depth, so a value of :samp:`2` would first grab
+``example.com``, queue all pages linked there as well as every reference on
+each of those pages.
+
+On the other hand
+
+.. code:: bash
+
+   crocoite -r prefix https://example.com/dir/ example.com.warc.gz
+
+will retrieve the URL specified and all pages referenced which have the same
+URL prefix. There trailing slash is significant. Without it crocoite would also
+grab ``/dir-something`` or ``/dir.html`` for example.
+
+If an output file template is used each page is written to an individual file. For example
+
+.. code:: bash
+
+   crocoite -r prefix https://example.com/ '{host}-{date}-{seqnum}.warc.gz'
+
+will write one file page page to files like
+:file:`example.com-2019-09-09T15:15:15+02:00-1.warc.gz`. ``seqnum`` is unique to
+each page of a single job and should always be used.
+
+When running a recursive job, increasing the concurrency (i.e. how many pages
+are fetched at the same time) can speed up the process. For example you can
+pass :option:`-j` :samp:`4` to retrieve four pages at the same time. Keep in mind
+that each process starts a full browser that requires a lot of resources (one
+to two GB of RAM and one or two CPU cores).
+
+Customizing
+^^^^^^^^^^^
+
+.. program:: crocoite-single
+
+Under the hood :program:`crocoite` starts one instance of
+:program:`crocoite-single` to fetch each page. You can customize its options by
+appending a command template like this:
+
+.. code:: bash
+
+   crocoite -r prefix https://example.com example.com.warc.gz -- \
+        crocoite-single --timeout 5 -k '{url}' '{dest}'
+
+This reduces the global timeout to 5 seconds and ignores TLS errors. If an
+option is prefixed with an exclamation mark (``!``) it will not be expanded.
+This is useful for passing :option:`--warcinfo`, which expects JSON-encoded data.
+
+Command line options
+^^^^^^^^^^^^^^^^^^^^
+
+Below is a list of all command line arguments available:
+
+.. program:: crocoite
+
+crocoite
+++++++++
+
+Front-end with recursion support and simple job management.
+
+.. option:: -j N, --concurrency N
+
+   Maximum number of concurrent fetch jobs.
+
+.. option:: -r POLICY, --recursion POLICY
+
+   Enables recursion based on POLICY, which can be a positive integer
+   (recursion depth) or the string :kbd:`prefix`.
+
+.. option:: --tempdir DIR
+
+   Directory for temporary WARC files.
+
+.. program:: crocoite-single
+
+crocoite-single
++++++++++++++++
+
+Back-end to fetch a single page.
+
+.. option:: -b SET-COOKIE, --cookie SET-COOKIE
+
+   Add cookie to browser’s cookie jar. This option always *appends* cookies,
+   replacing those provided by :option:`-c`.
+
+   .. versionadded:: 1.1
+
+.. option:: -c FILE, --cookie-jar FILE
+
+   Load cookies from FILE. :program:`crocoite` provides a default cookie file,
+   which contains cookies to, for example, circumvent age restrictions. This
+   option *replaces* that default file.
+
+   .. versionadded:: 1.1
+
+.. option:: --idle-timeout SEC
+
+   Time after which a page is considered “idle”.
+
+.. option:: -k, --insecure
+
+   Allow insecure connections, i.e. self-signed ore expired HTTPS certificates.
+
+.. option:: --timeout SEC
+
+   Global archiving timeout.
+
+
+.. option:: --warcinfo JSON
+
+   Inject additional JSON-encoded information into the resulting WARC.
+
+IRC bot
+^^^^^^^
+
+A simple IRC bot (“chromebot”) is provided with the command :program:`crocoite-irc`.
+It reads its configuration from a config file like the example provided in
+:file:`contrib/chromebot.json` and supports the following commands:
+
+a <url> -j <concurrency> -r <policy> -k -b <set-cookie>
+    Archive <url> with <concurrency> processes according to recursion <policy>
+s <uuid>
+    Get job status for <uuid>
+r <uuid>
+    Revoke or abort running job with <uuid>
diff --git a/setup.cfg b/setup.cfg
index 32dfadf..ec7d730 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -4,3 +4,5 @@ test=pytest
 addopts=--cov-report=html --cov-report=xml --cov=crocoite --cov-config=setup.cfg
 [coverage:run]
 branch=True
+[build_sphinx]
+builder=dirhtml
diff --git a/setup.py b/setup.py
index 5ae7e65..628442e 100644
--- a/setup.py
+++ b/setup.py
@@ -2,13 +2,15 @@ from setuptools import setup
 
 setup(
     name='crocoite',
-    version='0.1.0',
+    version='1.1.1',
     author='Lars-Dominik Braun',
     author_email='lars+crocoite@6xq.net',
+    url='https://6xq.net/crocoite/',
     packages=['crocoite'],
     license='LICENSE.txt',
     description='Save website to WARC using Google Chrome.',
     long_description=open('README.rst').read(),
+    long_description_content_type='text/x-rst',
     install_requires=[
         'warcio',
         'html5lib>=0.999999999',
@@ -17,20 +19,39 @@ setup(
         'websockets',
         'aiohttp',
         'PyYAML',
+        'yarl>=1.4,<1.5',
+        'multidict',
     ],
+    extras_require={
+        'manhole': ['manhole>=1.6'],
+    },
     entry_points={
     'console_scripts': [
-            'crocoite-grab = crocoite.cli:single',
-            'crocoite-recursive = crocoite.cli:recursive',
+            # the main executable
+            'crocoite = crocoite.cli:recursive',
+            # backend helper
+            'crocoite-single = crocoite.cli:single',
+            # irc bot and dashboard
             'crocoite-irc = crocoite.cli:irc',
             'crocoite-irc-dashboard = crocoite.cli:dashboard',
+            # misc tools
             'crocoite-merge-warc = crocoite.tools:mergeWarcCli',
             'crocoite-extract-screenshot = crocoite.tools:extractScreenshot',
+            'crocoite-errata = crocoite.tools:errata',
             ],
     },
     package_data={
             'crocoite': ['data/*'],
     },
-    setup_requires=["pytest-runner"],
-    tests_require=["pytest", 'pytest-asyncio', 'pytest-cov'],
+    setup_requires=['pytest-runner'],
+    tests_require=["pytest", 'pytest-asyncio', 'pytest-cov', 'hypothesis'],
+    python_requires='>=3.6',
+    classifiers=[
+        'Development Status :: 5 - Production/Stable',
+        'License :: OSI Approved :: MIT License',
+        'Operating System :: POSIX',
+        'Programming Language :: Python :: 3.6',
+        'Programming Language :: Python :: 3.7',
+        'Topic :: Internet :: WWW/HTTP',
+    ],
 )