From cb1d9e40ce99fd6c5d045e13e10619c8a24f12e8 Mon Sep 17 00:00:00 2001 From: Lars-Dominik Braun Date: Fri, 22 Mar 2019 12:25:22 +0100 Subject: Move documentation to Sphinx --- README.rst | 216 +------------------------------------------------------------ 1 file changed, 1 insertion(+), 215 deletions(-) mode change 100644 => 120000 README.rst (limited to 'README.rst') diff --git a/README.rst b/README.rst deleted file mode 100644 index 581ac13..0000000 --- a/README.rst +++ /dev/null @@ -1,215 +0,0 @@ -crocoite -======== - -Preservation for the modern web, powered by `headless Google -Chrome`_. - -.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master - :target: https://travis-ci.org/PromyLOPh/crocoite - -.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg - :target: https://codecov.io/gh/PromyLOPh/crocoite - -.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome - -Quick start ------------ - -These dependencies must be present to run crocoite: - -- Python ≥3.6 -- PyYAML_ -- aiohttp_ -- websockets_ -- warcio_ -- html5lib_ -- yarl_ -- multidict_ -- bottom_ (IRC client) -- `Google Chrome`_ - -.. _PyYAML: https://pyyaml.org/wiki/PyYAML -.. _aiohttp: https://aiohttp.readthedocs.io/ -.. _websockets: https://websockets.readthedocs.io/ -.. _warcio: https://github.com/webrecorder/warcio -.. _html5lib: https://github.com/html5lib/html5lib-python -.. _bottom: https://github.com/numberoverzero/bottom -.. _Google Chrome: https://www.google.com/chrome/ -.. _yarl: https://yarl.readthedocs.io/ -.. _multidict: https://multidict.readthedocs.io/ - -The following commands clone the repository from GitHub_, set up a virtual -environment and install crocoite: - -.. _GitHub: https://github.com/PromyLOPh/crocoite - -.. code:: bash - - git clone https://github.com/PromyLOPh/crocoite.git - cd crocoite - virtualenv -p python3 sandbox - source sandbox/bin/activate - pip install . - -One-shot command line interface and pywb_ playback: - -.. code:: bash - - pip install pywb - crocoite-grab http://example.com/ example.com.warc.gz - rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz - wayback & - $BROWSER http://localhost:8080 - -.. _pywb: https://github.com/ikreymer/pywb - -Rationale ---------- - -Most modern websites depend heavily on executing code, usually JavaScript, on -the user’s machine. They also make use of new and emerging Web technologies -like HTML5, WebSockets, service workers and more. Even worse from the -preservation point of view, they also require some form of user interaction to -dynamically load more content (infinite scrolling, dynamic comment loading, -etc). - -The naive approach of fetching a HTML page, parsing it and extracting -links to referenced resources therefore is not sufficient to create a faithful -snapshot of these web applications. A full browser, capable of running scripts and -providing modern Web API’s is absolutely required for this task. Thankfully -Google Chrome runs without a display (headless mode) and can be controlled by -external programs, allowing them to navigate and extract or inject data. -This section describes the solutions crocoite offers and explains design -decisions taken. - -crocoite captures resources by listening to Chrome’s `network events`_ and -requesting the response body using `Network.getResponseBody`_. This approach -has caveats: The original HTTP requests and responses, as sent over the wire, -are not available. They are reconstructed from parsed data. The character -encoding for text documents is changed to UTF-8. And the content body of HTTP -redirects cannot be retrieved due to a race condition. - -.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network -.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody - -But at the same time it allows crocoite to rely on Chrome’s well-tested network -stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as -transport protocols like SSL and QUIC. Depending on Chrome also eliminates the -need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL -traffic and present a fake certificate to the browser in order to store the -transmitted content. - -.. _warcprox: https://github.com/internetarchive/warcprox - -WARC records generated by crocoite therefore are an abstract view on the -resource they represent and not necessarily the data sent over the wire. A URL -fetched with HTTP/2 for example will still result in a HTTP/1.1 -request/response pair in the WARC file. This may be undesireable from -an archivist’s point of view (“save the data exactly like we received it”). But -this level of abstraction is inevitable when dealing with more than one -protocol. - -crocoite also interacts with and therefore alters the grabbed websites. It does -so by injecting `behavior scripts`_ into the site. Typically these are written -in JavaScript, because interacting with a page is easier this way. These -scripts then perform different tasks: Extracting targets from visible -hyperlinks, clicking buttons or scrolling the website to to load more content, -as well as taking a static screenshot of ```` elements for the DOM -snapshot (see below). - -.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data - -Replaying archived WARC’s can be quite challenging and might not be possible -with current technology (or even at all): - -- Some sites request assets based on screen resolution, pixel ratio and - supported image formats (webp). Replaying those with different parameters - won’t work, since assets for those are missing. Example: missguided.com. -- Some fetch different scripts based on user agent. Example: youtube.com. -- Requests containing randomly generated JavaScript callback function names - won’t work. Example: weather.com. -- Range requests (Range: bytes=1-100) are captured as-is, making playback - difficult - -crocoite offers two methods to work around these issues. Firstly it can save a -DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus -``