summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
l---------[-rw-r--r--]README.rst216
-rw-r--r--crocoite/cli.py44
-rw-r--r--doc/conf.py178
-rw-r--r--doc/develop.rst17
-rw-r--r--doc/index.rst41
-rw-r--r--doc/install.rst47
-rw-r--r--doc/rationale.rst76
-rw-r--r--doc/related.rst14
-rw-r--r--doc/usage.rst15
9 files changed, 433 insertions, 215 deletions
diff --git a/README.rst b/README.rst
index 581ac13..176d9c2 100644..120000
--- a/README.rst
+++ b/README.rst
@@ -1,215 +1 @@
-crocoite
-========
-
-Preservation for the modern web, powered by `headless Google
-Chrome`_.
-
-.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master
- :target: https://travis-ci.org/PromyLOPh/crocoite
-
-.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg
- :target: https://codecov.io/gh/PromyLOPh/crocoite
-
-.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
-
-Quick start
------------
-
-These dependencies must be present to run crocoite:
-
-- Python ≥3.6
-- PyYAML_
-- aiohttp_
-- websockets_
-- warcio_
-- html5lib_
-- yarl_
-- multidict_
-- bottom_ (IRC client)
-- `Google Chrome`_
-
-.. _PyYAML: https://pyyaml.org/wiki/PyYAML
-.. _aiohttp: https://aiohttp.readthedocs.io/
-.. _websockets: https://websockets.readthedocs.io/
-.. _warcio: https://github.com/webrecorder/warcio
-.. _html5lib: https://github.com/html5lib/html5lib-python
-.. _bottom: https://github.com/numberoverzero/bottom
-.. _Google Chrome: https://www.google.com/chrome/
-.. _yarl: https://yarl.readthedocs.io/
-.. _multidict: https://multidict.readthedocs.io/
-
-The following commands clone the repository from GitHub_, set up a virtual
-environment and install crocoite:
-
-.. _GitHub: https://github.com/PromyLOPh/crocoite
-
-.. code:: bash
-
- git clone https://github.com/PromyLOPh/crocoite.git
- cd crocoite
- virtualenv -p python3 sandbox
- source sandbox/bin/activate
- pip install .
-
-One-shot command line interface and pywb_ playback:
-
-.. code:: bash
-
- pip install pywb
- crocoite-grab http://example.com/ example.com.warc.gz
- rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
- wayback &
- $BROWSER http://localhost:8080
-
-.. _pywb: https://github.com/ikreymer/pywb
-
-Rationale
----------
-
-Most modern websites depend heavily on executing code, usually JavaScript, on
-the user’s machine. They also make use of new and emerging Web technologies
-like HTML5, WebSockets, service workers and more. Even worse from the
-preservation point of view, they also require some form of user interaction to
-dynamically load more content (infinite scrolling, dynamic comment loading,
-etc).
-
-The naive approach of fetching a HTML page, parsing it and extracting
-links to referenced resources therefore is not sufficient to create a faithful
-snapshot of these web applications. A full browser, capable of running scripts and
-providing modern Web API’s is absolutely required for this task. Thankfully
-Google Chrome runs without a display (headless mode) and can be controlled by
-external programs, allowing them to navigate and extract or inject data.
-This section describes the solutions crocoite offers and explains design
-decisions taken.
-
-crocoite captures resources by listening to Chrome’s `network events`_ and
-requesting the response body using `Network.getResponseBody`_. This approach
-has caveats: The original HTTP requests and responses, as sent over the wire,
-are not available. They are reconstructed from parsed data. The character
-encoding for text documents is changed to UTF-8. And the content body of HTTP
-redirects cannot be retrieved due to a race condition.
-
-.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
-.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
-
-But at the same time it allows crocoite to rely on Chrome’s well-tested network
-stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
-transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
-need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
-traffic and present a fake certificate to the browser in order to store the
-transmitted content.
-
-.. _warcprox: https://github.com/internetarchive/warcprox
-
-WARC records generated by crocoite therefore are an abstract view on the
-resource they represent and not necessarily the data sent over the wire. A URL
-fetched with HTTP/2 for example will still result in a HTTP/1.1
-request/response pair in the WARC file. This may be undesireable from
-an archivist’s point of view (“save the data exactly like we received it”). But
-this level of abstraction is inevitable when dealing with more than one
-protocol.
-
-crocoite also interacts with and therefore alters the grabbed websites. It does
-so by injecting `behavior scripts`_ into the site. Typically these are written
-in JavaScript, because interacting with a page is easier this way. These
-scripts then perform different tasks: Extracting targets from visible
-hyperlinks, clicking buttons or scrolling the website to to load more content,
-as well as taking a static screenshot of ``<canvas>`` elements for the DOM
-snapshot (see below).
-
-.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
-
-Replaying archived WARC’s can be quite challenging and might not be possible
-with current technology (or even at all):
-
-- Some sites request assets based on screen resolution, pixel ratio and
- supported image formats (webp). Replaying those with different parameters
- won’t work, since assets for those are missing. Example: missguided.com.
-- Some fetch different scripts based on user agent. Example: youtube.com.
-- Requests containing randomly generated JavaScript callback function names
- won’t work. Example: weather.com.
-- Range requests (Range: bytes=1-100) are captured as-is, making playback
- difficult
-
-crocoite offers two methods to work around these issues. Firstly it can save a
-DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
-``<script>`` tags after the site has been fully loaded and thus can be
-displayed without executing scripts. Obviously JavaScript-based navigation
-does not work any more. Secondly it also saves a screenshot of the full page,
-so even if future browsers cannot render and display the stored HTML a fully
-rendered version of the website can be replayed instead.
-
-Advanced usage
---------------
-
-crocoite is built with the Unix philosophy (“do one thing and do it well”) in
-mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion
-use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``.
-It can either recurse a maximum number of levels or grab all pages with the
-same prefix as the start URL:
-
-.. code:: bash
-
- crocoite-recursive --policy prefix http://www.example.com/dir/ output
-
-will save all pages in ``/dir/`` and below to individual files in the output
-directory ``output``. You can customize the command used to grab individual
-pages by appending it after ``output``. This way distributed grabs (ssh to a
-different machine and execute the job there, queue the command with Slurm, …)
-are possible.
-
-IRC bot
-^^^^^^^
-
-A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``.
-It reads its configuration from a config file like the example provided in
-``contrib/chromebot.json`` and supports the following commands:
-
-a <url> -j <concurrency> -r <policy>
- Archive <url> with <concurrency> processes according to recursion <policy>
-s <uuid>
- Get job status for <uuid>
-r <uuid>
- Revoke or abort running job with <uuid>
-
-Browser configuration
-^^^^^^^^^^^^^^^^^^^^^
-
-Generally crocoite provides reasonable defaults for Google Chrome via its
-`devtools module`_. When debugging this software it might be necessary to open
-a non-headless instance of the browser by running
-
-.. code:: bash
-
- google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs
-
-and then passing the option ``--browser=http://localhost:9222`` to
-``crocoite-grab``. This allows human intervention through the browser’s builtin
-console.
-
-Another issue that might arise is related to fonts. Headless servers usually
-don’t have them installed by default and thus rendered screenshots may contain
-replacement characters (□) instead of the actual text. This affects mostly
-non-latin character sets. It is therefore recommended to install at least
-Micrsoft’s Corefonts_ as well as DejaVu_, Liberation_ or a similar font family
-covering a wide range of character sets.
-
-.. _devtools module: crocoite/devtools.py
-.. _Corefonts: http://corefonts.sourceforge.net/
-.. _DejaVu: https://dejavu-fonts.github.io/
-.. _Liberation: https://pagure.io/liberation-fonts
-
-Related projects
-----------------
-
-brozzler_
- Uses Google Chrome as well, but intercepts traffic using a proxy. Supports
- distributed crawling and immediate playback.
-Squidwarc_
- Communicates with headless Google Chrome and uses the Network API to
- retrieve requests like crocoite. Supports recursive crawls and page
- scrolling, but neither custom JavaScript nor distributed crawling.
-
-.. _brozzler: https://github.com/internetarchive/brozzler
-.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc
-
+doc/index.rst \ No newline at end of file
diff --git a/crocoite/cli.py b/crocoite/cli.py
index fb9060d..d9ebc4d 100644
--- a/crocoite/cli.py
+++ b/crocoite/cli.py
@@ -50,6 +50,19 @@ class SingleExitStatus(IntEnum):
Navigate = 3
def single ():
+ """
+ One-shot command line interface and pywb_ playback:
+
+ .. code:: bash
+
+ pip install pywb
+ crocoite-grab http://example.com/ example.com.warc.gz
+ rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
+ wayback &
+ $BROWSER http://localhost:8080
+
+ .. _pywb: https://github.com/ikreymer/pywb
+ """
parser = argparse.ArgumentParser(description='Save website to WARC using Google Chrome.')
parser.add_argument('--browser', help='DevTools URL', metavar='URL')
parser.add_argument('--timeout', default=1*60*60, type=int, help='Maximum time for archival', metavar='SEC')
@@ -114,6 +127,24 @@ def parsePolicy (recursive, url):
raise ValueError ('Unsupported')
def recursive ():
+ """
+ crocoite is built with the Unix philosophy (“do one thing and do it well”) in
+ mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion
+ use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``.
+ It can either recurse a maximum number of levels or grab all pages with the
+ same prefix as the start URL:
+
+ .. code:: bash
+
+ crocoite-recursive --policy prefix http://www.example.com/dir/ output
+
+ will save all pages in ``/dir/`` and below to individual files in the output
+ directory ``output``. You can customize the command used to grab individual
+ pages by appending it after ``output``. This way distributed grabs (ssh to a
+ different machine and execute the job there, queue the command with Slurm, …)
+ are possible.
+ """
+
logger = Logger (consumer=[DatetimeConsumer (), JsonPrintConsumer ()])
parser = argparse.ArgumentParser(description='Recursively run crocoite-grab.')
@@ -149,6 +180,19 @@ def recursive ():
return 0
def irc ():
+ """
+ A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``.
+ It reads its configuration from a config file like the example provided in
+ ``contrib/chromebot.json`` and supports the following commands:
+
+ a <url> -j <concurrency> -r <policy>
+ Archive <url> with <concurrency> processes according to recursion <policy>
+ s <uuid>
+ Get job status for <uuid>
+ r <uuid>
+ Revoke or abort running job with <uuid>
+ """
+
import json, re
from .irc import Chromebot
diff --git a/doc/conf.py b/doc/conf.py
new file mode 100644
index 0000000..746dbc6
--- /dev/null
+++ b/doc/conf.py
@@ -0,0 +1,178 @@
+# -*- coding: utf-8 -*-
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+
+
+# -- Project information -----------------------------------------------------
+
+project = 'crocoite'
+copyright = ''
+author = ''
+
+# The short X.Y version
+version = ''
+# The full version, including alpha/beta/rc tags
+release = '0.1'
+
+
+# -- General configuration ---------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+ 'sphinx.ext.viewcode',
+ 'sphinx.ext.autodoc',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'alabaster'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further. For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself. Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'crocoitedoc'
+
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+ # The paper size ('letterpaper' or 'a4paper').
+ #
+ # 'papersize': 'letterpaper',
+
+ # The font size ('10pt', '11pt' or '12pt').
+ #
+ # 'pointsize': '10pt',
+
+ # Additional stuff for the LaTeX preamble.
+ #
+ # 'preamble': '',
+
+ # Latex figure (float) alignment
+ #
+ # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+# author, documentclass [howto, manual, or own class]).
+latex_documents = [
+ (master_doc, 'crocoite.tex', 'crocoite Documentation',
+ 'crocoite contributors', 'manual'),
+]
+
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+ (master_doc, 'crocoite', 'crocoite Documentation',
+ [author], 1)
+]
+
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+# dir menu entry, description, category)
+texinfo_documents = [
+ (master_doc, 'crocoite', 'crocoite Documentation',
+ author, 'crocoite', 'One line description of project.',
+ 'Miscellaneous'),
+]
+
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+
+# -- Extension configuration -------------------------------------------------
diff --git a/doc/develop.rst b/doc/develop.rst
new file mode 100644
index 0000000..0113c92
--- /dev/null
+++ b/doc/develop.rst
@@ -0,0 +1,17 @@
+Development
+-----------
+
+Generally crocoite provides reasonable defaults for Google Chrome via its
+`devtools module`_. When debugging this software it might be necessary to open
+a non-headless instance of the browser by running
+
+.. code:: bash
+
+ google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs
+
+and then passing the option ``--browser=http://localhost:9222`` to
+``crocoite-grab``. This allows human intervention through the browser’s builtin
+console.
+
+.. _devtools module: crocoite/devtools.py
+
diff --git a/doc/index.rst b/doc/index.rst
new file mode 100644
index 0000000..d62c7e1
--- /dev/null
+++ b/doc/index.rst
@@ -0,0 +1,41 @@
+crocoite
+========
+
+Preservation for the modern web, powered by `headless Google
+Chrome`_.
+
+.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master
+ :target: https://travis-ci.org/PromyLOPh/crocoite
+
+.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg
+ :target: https://codecov.io/gh/PromyLOPh/crocoite
+
+.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
+
+.. toctree::
+ :maxdepth: 1
+ :hidden:
+
+ install.rst
+ usage.rst
+ rationale.rst
+ develop.rst
+ related.rst
+
+Features
+--------
+
+Google Chrome-powered
+ HTML renderer, JavaScript engine and network stack, supporting modern web
+ technologies and protocols
+WARC output
+ Includes all network requests made by the browser
+Site interaction
+ Auto-expand on-click content, infinite-scrolling
+DOM snapshot
+ Contains the page’s state, renderable without JavaScript
+Image screenshot
+ Entire page
+Machine-readable interface
+ Easy integration into custom tools/scripts
+
diff --git a/doc/install.rst b/doc/install.rst
new file mode 100644
index 0000000..5e76956
--- /dev/null
+++ b/doc/install.rst
@@ -0,0 +1,47 @@
+Installation
+------------
+
+These dependencies must be present to run crocoite:
+
+- Python ≥3.6
+- PyYAML_
+- aiohttp_
+- websockets_
+- warcio_
+- html5lib_
+- yarl_
+- multidict_
+- bottom_ (IRC client)
+- `Google Chrome`_
+
+.. _PyYAML: https://pyyaml.org/wiki/PyYAML
+.. _aiohttp: https://aiohttp.readthedocs.io/
+.. _websockets: https://websockets.readthedocs.io/
+.. _warcio: https://github.com/webrecorder/warcio
+.. _html5lib: https://github.com/html5lib/html5lib-python
+.. _bottom: https://github.com/numberoverzero/bottom
+.. _Google Chrome: https://www.google.com/chrome/
+.. _yarl: https://yarl.readthedocs.io/
+.. _multidict: https://multidict.readthedocs.io/
+
+The following commands clone the repository from GitHub_, set up a virtual
+environment and install crocoite:
+
+.. _GitHub: https://github.com/PromyLOPh/crocoite
+
+.. code:: bash
+
+ git clone https://github.com/PromyLOPh/crocoite.git
+ cd crocoite
+ virtualenv -p python3 sandbox
+ source sandbox/bin/activate
+ pip install .
+
+It is recommended to install at least Micrsoft’s Corefonts_ as well as DejaVu_,
+Liberation_ or a similar font family covering a wide range of character sets.
+Otherwise page screenshots may be unusable due to missing glyphs.
+
+.. _Corefonts: http://corefonts.sourceforge.net/
+.. _DejaVu: https://dejavu-fonts.github.io/
+.. _Liberation: https://pagure.io/liberation-fonts
+
diff --git a/doc/rationale.rst b/doc/rationale.rst
new file mode 100644
index 0000000..f37db7c
--- /dev/null
+++ b/doc/rationale.rst
@@ -0,0 +1,76 @@
+Rationale
+---------
+
+Most modern websites depend heavily on executing code, usually JavaScript, on
+the user’s machine. They also make use of new and emerging Web technologies
+like HTML5, WebSockets, service workers and more. Even worse from the
+preservation point of view, they also require some form of user interaction to
+dynamically load more content (infinite scrolling, dynamic comment loading,
+etc).
+
+The naive approach of fetching a HTML page, parsing it and extracting
+links to referenced resources therefore is not sufficient to create a faithful
+snapshot of these web applications. A full browser, capable of running scripts and
+providing modern Web API’s is absolutely required for this task. Thankfully
+Google Chrome runs without a display (headless mode) and can be controlled by
+external programs, allowing them to navigate and extract or inject data.
+This section describes the solutions crocoite offers and explains design
+decisions taken.
+
+crocoite captures resources by listening to Chrome’s `network events`_ and
+requesting the response body using `Network.getResponseBody`_. This approach
+has caveats: The original HTTP requests and responses, as sent over the wire,
+are not available. They are reconstructed from parsed data. The character
+encoding for text documents is changed to UTF-8. And the content body of HTTP
+redirects cannot be retrieved due to a race condition.
+
+.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
+.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
+
+But at the same time it allows crocoite to rely on Chrome’s well-tested network
+stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
+transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
+need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
+traffic and present a fake certificate to the browser in order to store the
+transmitted content.
+
+.. _warcprox: https://github.com/internetarchive/warcprox
+
+WARC records generated by crocoite therefore are an abstract view on the
+resource they represent and not necessarily the data sent over the wire. A URL
+fetched with HTTP/2 for example will still result in a HTTP/1.1
+request/response pair in the WARC file. This may be undesireable from
+an archivist’s point of view (“save the data exactly like we received it”). But
+this level of abstraction is inevitable when dealing with more than one
+protocol.
+
+crocoite also interacts with and therefore alters the grabbed websites. It does
+so by injecting `behavior scripts`_ into the site. Typically these are written
+in JavaScript, because interacting with a page is easier this way. These
+scripts then perform different tasks: Extracting targets from visible
+hyperlinks, clicking buttons or scrolling the website to to load more content,
+as well as taking a static screenshot of ``<canvas>`` elements for the DOM
+snapshot (see below).
+
+.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
+
+Replaying archived WARC’s can be quite challenging and might not be possible
+with current technology (or even at all):
+
+- Some sites request assets based on screen resolution, pixel ratio and
+ supported image formats (webp). Replaying those with different parameters
+ won’t work, since assets for those are missing. Example: missguided.com.
+- Some fetch different scripts based on user agent. Example: youtube.com.
+- Requests containing randomly generated JavaScript callback function names
+ won’t work. Example: weather.com.
+- Range requests (Range: bytes=1-100) are captured as-is, making playback
+ difficult
+
+crocoite offers two methods to work around these issues. Firstly it can save a
+DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
+``<script>`` tags after the site has been fully loaded and thus can be
+displayed without executing scripts. Obviously JavaScript-based navigation
+does not work any more. Secondly it also saves a screenshot of the full page,
+so even if future browsers cannot render and display the stored HTML a fully
+rendered version of the website can be replayed instead.
+
diff --git a/doc/related.rst b/doc/related.rst
new file mode 100644
index 0000000..62e2569
--- /dev/null
+++ b/doc/related.rst
@@ -0,0 +1,14 @@
+Related projects
+----------------
+
+brozzler_
+ Uses Google Chrome as well, but intercepts traffic using a proxy. Supports
+ distributed crawling and immediate playback.
+Squidwarc_
+ Communicates with headless Google Chrome and uses the Network API to
+ retrieve requests like crocoite. Supports recursive crawls and page
+ scrolling, but neither custom JavaScript nor distributed crawling.
+
+.. _brozzler: https://github.com/internetarchive/brozzler
+.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc
+
diff --git a/doc/usage.rst b/doc/usage.rst
new file mode 100644
index 0000000..9049356
--- /dev/null
+++ b/doc/usage.rst
@@ -0,0 +1,15 @@
+Usage
+-----
+
+.. autofunction:: crocoite.cli.single
+
+Recursion
+^^^^^^^^^
+
+.. autofunction:: crocoite.cli.recursive
+
+IRC bot
+^^^^^^^
+
+.. autofunction:: crocoite.cli.irc
+