summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLars-Dominik Braun <lars@6xq.net>2018-08-19 11:50:23 +0200
committerLars-Dominik Braun <lars@6xq.net>2018-08-19 11:50:23 +0200
commit8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92 (patch)
tree8c4fbb79fb00ed5ecf77a667c287dcd62e50ff87
parent8b588adb9516afcaa6cd94172f856e31066baa2a (diff)
downloadcrocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.tar.gz
crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.tar.bz2
crocoite-8e5ac24c85ca9388410b2afda9a05fa4a3d9bf92.zip
README: Add rationale
Explain a few design decisions
-rw-r--r--README.rst112
1 files changed, 87 insertions, 25 deletions
diff --git a/README.rst b/README.rst
index 08e14ba..7108491 100644
--- a/README.rst
+++ b/README.rst
@@ -8,24 +8,35 @@ Archive websites using `headless Google Chrome`_ and its DevTools protocol.
.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
-Dependencies
-------------
+Quick start
+-----------
+
+The following dependencies must be present to run crocoite:
- Python 3
- pychrome_
- warcio_
- html5lib_
-- Celery_
+- Celery_ (optional)
.. _pychrome: https://github.com/fate0/pychrome
.. _warcio: https://github.com/webrecorder/warcio
.. _html5lib: https://github.com/html5lib/html5lib-python
.. _Celery: http://www.celeryproject.org/
-Usage
------
+It is recommended to prepare a virtualenv and let pip handle the dependency
+resolution for Python packages instead:
+
+.. code:: bash
+
+ cd crocoite
+ virtualenv -p python3 sandbox
+ source sandbox/bin/activate
+ pip install .
+
+One-shot commandline interface and pywb_ playback:
-One-shot commandline interface and pywb_ playback::
+.. code:: bash
crocoite-grab --output example.com.warc.gz http://example.com/
rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
@@ -34,21 +45,65 @@ One-shot commandline interface and pywb_ playback::
.. _pywb: https://github.com/ikreymer/pywb
-Behavior scripts
-^^^^^^^^^^^^^^^^
+Rationale
+---------
+
+Most modern websites depend heavily on executing code, usually JavaScript, on
+the user’s machine. They also make use of new and emerging Web technologies
+like HTML5, WebSockets, service workers and more. Even worse from the
+preservation point of view, they also require some form of user interaction to
+dynamically load more content (infinite scrolling, dynamic comment loading,
+etc).
+
+The naive approach of fetching a HTML page, parsing it and extracting
+links to referenced resources therefore is not sufficient to create a faithful
+snapshot of these web applications. A full browser, capable of running scripts and
+providing modern Web API’s is absolutely required for this task. Thankfully
+Google Chrome runs without a display (headless mode) and can be controlled by
+external programs, allowing them to navigate and extract or inject data.
+This section describes the solutions crocoite offers and explains design
+decisions taken.
+
+crocoite captures resources by listening to Chrome’s `network events`_ and
+requesting the response body using `Network.getResponseBody`_. This approach
+has caveats: The original HTTP requests and responses, as sent over the wire,
+are not available. They are reconstructed from parsed data. The character
+encoding for text documents is changed to UTF-8. And the content body of HTTP
+redirects cannot be retrieved due to a race condition.
+
+.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
+.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
+
+But at the same time it allows crocoite to rely on Chrome’s well-tested network
+stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
+transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
+need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
+traffic and present a fake certificate to the browser in order to store the
+transmitted content.
+
+.. _warcprox: https://github.com/internetarchive/warcprox
+
+WARC records generated by crocoite therefore are an abstract view on the
+resource they represent and not necessarily the data sent over the wire. A URL
+fetched with HTTP/2 for example will still result in a HTTP/1.1
+request/response pair in the WARC file. This may be undesireable from
+an archivist’s point of view (“save the data exactly like we received it”). But
+this level of abstraction is inevitable when dealing with more than one
+protocol.
+
+crocoite also interacts with and therefore alters the grabbed websites. It does
+so by injecting `behavior scripts`_ into the site. Typically these are written
+in JavaScript, because interacting with a page is easier this way. These
+scripts then perform different tasks: Extracting targets from visible
+hyperlinks, clicking buttons or scrolling the website to to load more content,
+as well as taking a static screenshot of ``<canvas>`` elements for the DOM
+snapshot (see below).
+
+.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
+
+Replaying archived WARC’s can be quite challenging and might not be possible
+with current technology (or even at all):
-A lot of sites need some form of interaction to dynamically load more content. Twitter for
-instance continously loads new posts when scrolling to the bottom of the page.
-crocoite can emulate these user interactions (and more) by combining control
-code written in Python and injecting JavaScript into the page. The code can be
-limited to certain URLs or apply to every page loaded. By default all scripts
-available are enabled, see command line flag ``--behavior``.
-
-Caveats
--------
-
-- Original HTTP requests/responses are not available. They are rebuilt from
- parsed data. Character encoding for text documents is changed to UTF-8.
- Some sites request assets based on screen resolution, pixel ratio and
supported image formats (webp). Replaying those with different parameters
won’t work, since assets for those are missing. Example: missguided.com.
@@ -57,15 +112,22 @@ Caveats
won’t work. Example: weather.com.
- Range requests (Range: bytes=1-100) are captured as-is, making playback
difficult
-- Content body of HTTP redirects cannot be retrived due to race condition
-Most of these issues can be worked around by using the DOM snapshot, which is
-also saved. This causes its own set of issues though:
+crocoite offers two methods to work around these issues. Firstly it can save a
+DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
+``<script>`` tags after the site has been fully loaded and thus can be
+displayed without executing scripts. Obviously JavaScript-based navigation
+does not work any more. Secondly it also saves a screenshot of the full page,
+so even if future browsers cannot render and display the stored HTML a fully
+rendered version of the website can be replayed instead.
+
+Advanced usage
+--------------
-- JavaScript-based navigation does not work.
+crocoite offers more than just a one-shot command-line interface.
Distributed crawling
---------------------
+^^^^^^^^^^^^^^^^^^^^
Configure using celeryconfig.py