README: Add rationale

Explain a few design decisions
author: Lars-Dominik Braun <lars@6xq.net> 2018-08-19 11:50:23 +0200
committer: Lars-Dominik Braun <lars@6xq.net> 2018-08-19 11:50:23 +0200
commit: 8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92 (patch)
tree: 8c4fbb79fb00ed5ecf77a667c287dcd62e50ff87 /README.rst
parent: 8b588adb9516afcaa6cd94172f856e31066baa2a (diff)
download: crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.tar.gz
crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.tar.bz2
crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.zip
1 files changed, 87 insertions, 25 deletions
diff --git a/README.rst b/README.rst
index 08e14ba..7108491 100644
--- a/README.rst
+++ b/README.rst
@@ -8,24 +8,35 @@ Archive websites using `headless Google Chrome`_ and its DevTools protocol.
 
 .. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
 
-Dependencies
-------------
+Quick start
+-----------
+
+The following dependencies must be present to run crocoite:
 
 - Python 3
 - pychrome_ 
 - warcio_
 - html5lib_
-- Celery_
+- Celery_ (optional)
 
 .. _pychrome: https://github.com/fate0/pychrome
 .. _warcio: https://github.com/webrecorder/warcio
 .. _html5lib: https://github.com/html5lib/html5lib-python
 .. _Celery: http://www.celeryproject.org/
 
-Usage
------
+It is recommended to prepare a virtualenv and let pip handle the dependency
+resolution for Python packages instead:
+
+.. code:: bash
+
+    cd crocoite
+    virtualenv -p python3 sandbox
+    source sandbox/bin/activate
+    pip install .
+
+One-shot commandline interface and pywb_ playback:
 
-One-shot commandline interface and pywb_ playback::
+.. code:: bash
 
     crocoite-grab --output example.com.warc.gz http://example.com/
     rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
@@ -34,21 +45,65 @@ One-shot commandline interface and pywb_ playback::
 
 .. _pywb: https://github.com/ikreymer/pywb
 
-Behavior scripts
-^^^^^^^^^^^^^^^^
+Rationale
+---------
+
+Most modern websites depend heavily on executing code, usually JavaScript, on
+the user’s machine. They also make use of new and emerging Web technologies
+like HTML5, WebSockets, service workers and more. Even worse from the
+preservation point of view, they also require some form of user interaction to
+dynamically load more content (infinite scrolling, dynamic comment loading,
+etc).
+
+The naive approach of fetching a HTML page, parsing it and extracting
+links to referenced resources therefore is not sufficient to create a faithful
+snapshot of these web applications. A full browser, capable of running scripts and
+providing modern Web API’s is absolutely required for this task. Thankfully
+Google Chrome runs without a display (headless mode) and can be controlled by
+external programs, allowing them to navigate and extract or inject data.
+This section describes the solutions crocoite offers and explains design
+decisions taken.
+
+crocoite captures resources by listening to Chrome’s `network events`_ and
+requesting the response body using `Network.getResponseBody`_. This approach
+has caveats: The original HTTP requests and responses, as sent over the wire,
+are not available. They are reconstructed from parsed data. The character
+encoding for text documents is changed to UTF-8. And the content body of HTTP
+redirects cannot be retrieved due to a race condition.
+
+.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
+.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
+
+But at the same time it allows crocoite to rely on Chrome’s well-tested network
+stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
+transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
+need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
+traffic and present a fake certificate to the browser in order to store the
+transmitted content.
+
+.. _warcprox: https://github.com/internetarchive/warcprox
+
+WARC records generated by crocoite therefore are an abstract view on the
+resource they represent and not necessarily the data sent over the wire. A URL
+fetched with HTTP/2 for example will still result in a HTTP/1.1
+request/response pair in the WARC file. This may be undesireable from
+an archivist’s point of view (“save the data exactly like we received it”). But
+this level of abstraction is inevitable when dealing with more than one
+protocol.
+
+crocoite also interacts with and therefore alters the grabbed websites. It does
+so by injecting `behavior scripts`_ into the site. Typically these are written
+in JavaScript, because interacting with a page is easier this way. These
+scripts then perform different tasks: Extracting targets from visible
+hyperlinks, clicking buttons or scrolling the website to to load more content,
+as well as taking a static screenshot of ``<canvas>`` elements for the DOM
+snapshot (see below).
+
+.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
+
+Replaying archived WARC’s can be quite challenging and might not be possible
+with current technology (or even at all):
 
-A lot of sites need some form of interaction to dynamically load more content. Twitter for
-instance continously loads new posts when scrolling to the bottom of the page.
-crocoite can emulate these user interactions (and more) by combining control
-code written in Python and injecting JavaScript into the page. The code can be
-limited to certain URLs or apply to every page loaded. By default all scripts
-available are enabled, see command line flag ``--behavior``.
-
-Caveats
--------
-
-- Original HTTP requests/responses are not available. They are rebuilt from
-  parsed data. Character encoding for text documents is changed to UTF-8.
 - Some sites request assets based on screen resolution, pixel ratio and
   supported image formats (webp). Replaying those with different parameters
   won’t work, since assets for those are missing. Example: missguided.com.
@@ -57,15 +112,22 @@ Caveats
   won’t work. Example: weather.com.
 - Range requests (Range: bytes=1-100) are captured as-is, making playback
   difficult
-- Content body of HTTP redirects cannot be retrived due to race condition
 
-Most of these issues can be worked around by using the DOM snapshot, which is
-also saved. This causes its own set of issues though:
+crocoite offers two methods to work around these issues. Firstly it can save a
+DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
+``<script>`` tags after the site has been fully loaded and thus can be
+displayed without executing scripts.  Obviously JavaScript-based navigation
+does not work any more. Secondly it also saves a screenshot of the full page,
+so even if future browsers cannot render and display the stored HTML a fully
+rendered version of the website can be replayed instead.
+
+Advanced usage
+--------------
 
-- JavaScript-based navigation does not work.
+crocoite offers more than just a one-shot command-line interface.
 
 Distributed crawling
---------------------
+^^^^^^^^^^^^^^^^^^^^
 
 Configure using celeryconfig.py
author	Lars-Dominik Braun <lars@6xq.net>	2018-08-19 11:50:23 +0200
committer	Lars-Dominik Braun <lars@6xq.net>	2018-08-19 11:50:23 +0200
commit	8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92 (patch)
tree	8c4fbb79fb00ed5ecf77a667c287dcd62e50ff87 /README.rst
parent	8b588adb9516afcaa6cd94172f856e31066baa2a (diff)
download	crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.tar.gz crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.tar.bz2 crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.zip