diff options
-rw-r--r-- | README.rst | 112 |
1 files changed, 87 insertions, 25 deletions
@@ -8,24 +8,35 @@ Archive websites using `headless Google Chrome`_ and its DevTools protocol. .. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome -Dependencies ------------- +Quick start +----------- + +The following dependencies must be present to run crocoite: - Python 3 - pychrome_ - warcio_ - html5lib_ -- Celery_ +- Celery_ (optional) .. _pychrome: https://github.com/fate0/pychrome .. _warcio: https://github.com/webrecorder/warcio .. _html5lib: https://github.com/html5lib/html5lib-python .. _Celery: http://www.celeryproject.org/ -Usage ------ +It is recommended to prepare a virtualenv and let pip handle the dependency +resolution for Python packages instead: + +.. code:: bash + + cd crocoite + virtualenv -p python3 sandbox + source sandbox/bin/activate + pip install . + +One-shot commandline interface and pywb_ playback: -One-shot commandline interface and pywb_ playback:: +.. code:: bash crocoite-grab --output example.com.warc.gz http://example.com/ rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz @@ -34,21 +45,65 @@ One-shot commandline interface and pywb_ playback:: .. _pywb: https://github.com/ikreymer/pywb -Behavior scripts -^^^^^^^^^^^^^^^^ +Rationale +--------- + +Most modern websites depend heavily on executing code, usually JavaScript, on +the user’s machine. They also make use of new and emerging Web technologies +like HTML5, WebSockets, service workers and more. Even worse from the +preservation point of view, they also require some form of user interaction to +dynamically load more content (infinite scrolling, dynamic comment loading, +etc). + +The naive approach of fetching a HTML page, parsing it and extracting +links to referenced resources therefore is not sufficient to create a faithful +snapshot of these web applications. A full browser, capable of running scripts and +providing modern Web API’s is absolutely required for this task. Thankfully +Google Chrome runs without a display (headless mode) and can be controlled by +external programs, allowing them to navigate and extract or inject data. +This section describes the solutions crocoite offers and explains design +decisions taken. + +crocoite captures resources by listening to Chrome’s `network events`_ and +requesting the response body using `Network.getResponseBody`_. This approach +has caveats: The original HTTP requests and responses, as sent over the wire, +are not available. They are reconstructed from parsed data. The character +encoding for text documents is changed to UTF-8. And the content body of HTTP +redirects cannot be retrieved due to a race condition. + +.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network +.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody + +But at the same time it allows crocoite to rely on Chrome’s well-tested network +stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as +transport protocols like SSL and QUIC. Depending on Chrome also eliminates the +need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL +traffic and present a fake certificate to the browser in order to store the +transmitted content. + +.. _warcprox: https://github.com/internetarchive/warcprox + +WARC records generated by crocoite therefore are an abstract view on the +resource they represent and not necessarily the data sent over the wire. A URL +fetched with HTTP/2 for example will still result in a HTTP/1.1 +request/response pair in the WARC file. This may be undesireable from +an archivist’s point of view (“save the data exactly like we received it”). But +this level of abstraction is inevitable when dealing with more than one +protocol. + +crocoite also interacts with and therefore alters the grabbed websites. It does +so by injecting `behavior scripts`_ into the site. Typically these are written +in JavaScript, because interacting with a page is easier this way. These +scripts then perform different tasks: Extracting targets from visible +hyperlinks, clicking buttons or scrolling the website to to load more content, +as well as taking a static screenshot of ``<canvas>`` elements for the DOM +snapshot (see below). + +.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data + +Replaying archived WARC’s can be quite challenging and might not be possible +with current technology (or even at all): -A lot of sites need some form of interaction to dynamically load more content. Twitter for -instance continously loads new posts when scrolling to the bottom of the page. -crocoite can emulate these user interactions (and more) by combining control -code written in Python and injecting JavaScript into the page. The code can be -limited to certain URLs or apply to every page loaded. By default all scripts -available are enabled, see command line flag ``--behavior``. - -Caveats -------- - -- Original HTTP requests/responses are not available. They are rebuilt from - parsed data. Character encoding for text documents is changed to UTF-8. - Some sites request assets based on screen resolution, pixel ratio and supported image formats (webp). Replaying those with different parameters won’t work, since assets for those are missing. Example: missguided.com. @@ -57,15 +112,22 @@ Caveats won’t work. Example: weather.com. - Range requests (Range: bytes=1-100) are captured as-is, making playback difficult -- Content body of HTTP redirects cannot be retrived due to race condition -Most of these issues can be worked around by using the DOM snapshot, which is -also saved. This causes its own set of issues though: +crocoite offers two methods to work around these issues. Firstly it can save a +DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus +``<script>`` tags after the site has been fully loaded and thus can be +displayed without executing scripts. Obviously JavaScript-based navigation +does not work any more. Secondly it also saves a screenshot of the full page, +so even if future browsers cannot render and display the stored HTML a fully +rendered version of the website can be replayed instead. + +Advanced usage +-------------- -- JavaScript-based navigation does not work. +crocoite offers more than just a one-shot command-line interface. Distributed crawling --------------------- +^^^^^^^^^^^^^^^^^^^^ Configure using celeryconfig.py |