diff options
Diffstat (limited to 'README.rst')
l---------[-rw-r--r--] | README.rst | 216 |
1 files changed, 1 insertions, 215 deletions
diff --git a/README.rst b/README.rst index 581ac13..176d9c2 100644..120000 --- a/README.rst +++ b/README.rst @@ -1,215 +1 @@ -crocoite -======== - -Preservation for the modern web, powered by `headless Google -Chrome`_. - -.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master - :target: https://travis-ci.org/PromyLOPh/crocoite - -.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg - :target: https://codecov.io/gh/PromyLOPh/crocoite - -.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome - -Quick start ------------ - -These dependencies must be present to run crocoite: - -- Python ≥3.6 -- PyYAML_ -- aiohttp_ -- websockets_ -- warcio_ -- html5lib_ -- yarl_ -- multidict_ -- bottom_ (IRC client) -- `Google Chrome`_ - -.. _PyYAML: https://pyyaml.org/wiki/PyYAML -.. _aiohttp: https://aiohttp.readthedocs.io/ -.. _websockets: https://websockets.readthedocs.io/ -.. _warcio: https://github.com/webrecorder/warcio -.. _html5lib: https://github.com/html5lib/html5lib-python -.. _bottom: https://github.com/numberoverzero/bottom -.. _Google Chrome: https://www.google.com/chrome/ -.. _yarl: https://yarl.readthedocs.io/ -.. _multidict: https://multidict.readthedocs.io/ - -The following commands clone the repository from GitHub_, set up a virtual -environment and install crocoite: - -.. _GitHub: https://github.com/PromyLOPh/crocoite - -.. code:: bash - - git clone https://github.com/PromyLOPh/crocoite.git - cd crocoite - virtualenv -p python3 sandbox - source sandbox/bin/activate - pip install . - -One-shot command line interface and pywb_ playback: - -.. code:: bash - - pip install pywb - crocoite-grab http://example.com/ example.com.warc.gz - rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz - wayback & - $BROWSER http://localhost:8080 - -.. _pywb: https://github.com/ikreymer/pywb - -Rationale ---------- - -Most modern websites depend heavily on executing code, usually JavaScript, on -the user’s machine. They also make use of new and emerging Web technologies -like HTML5, WebSockets, service workers and more. Even worse from the -preservation point of view, they also require some form of user interaction to -dynamically load more content (infinite scrolling, dynamic comment loading, -etc). - -The naive approach of fetching a HTML page, parsing it and extracting -links to referenced resources therefore is not sufficient to create a faithful -snapshot of these web applications. A full browser, capable of running scripts and -providing modern Web API’s is absolutely required for this task. Thankfully -Google Chrome runs without a display (headless mode) and can be controlled by -external programs, allowing them to navigate and extract or inject data. -This section describes the solutions crocoite offers and explains design -decisions taken. - -crocoite captures resources by listening to Chrome’s `network events`_ and -requesting the response body using `Network.getResponseBody`_. This approach -has caveats: The original HTTP requests and responses, as sent over the wire, -are not available. They are reconstructed from parsed data. The character -encoding for text documents is changed to UTF-8. And the content body of HTTP -redirects cannot be retrieved due to a race condition. - -.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network -.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody - -But at the same time it allows crocoite to rely on Chrome’s well-tested network -stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as -transport protocols like SSL and QUIC. Depending on Chrome also eliminates the -need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL -traffic and present a fake certificate to the browser in order to store the -transmitted content. - -.. _warcprox: https://github.com/internetarchive/warcprox - -WARC records generated by crocoite therefore are an abstract view on the -resource they represent and not necessarily the data sent over the wire. A URL -fetched with HTTP/2 for example will still result in a HTTP/1.1 -request/response pair in the WARC file. This may be undesireable from -an archivist’s point of view (“save the data exactly like we received it”). But -this level of abstraction is inevitable when dealing with more than one -protocol. - -crocoite also interacts with and therefore alters the grabbed websites. It does -so by injecting `behavior scripts`_ into the site. Typically these are written -in JavaScript, because interacting with a page is easier this way. These -scripts then perform different tasks: Extracting targets from visible -hyperlinks, clicking buttons or scrolling the website to to load more content, -as well as taking a static screenshot of ``<canvas>`` elements for the DOM -snapshot (see below). - -.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data - -Replaying archived WARC’s can be quite challenging and might not be possible -with current technology (or even at all): - -- Some sites request assets based on screen resolution, pixel ratio and - supported image formats (webp). Replaying those with different parameters - won’t work, since assets for those are missing. Example: missguided.com. -- Some fetch different scripts based on user agent. Example: youtube.com. -- Requests containing randomly generated JavaScript callback function names - won’t work. Example: weather.com. -- Range requests (Range: bytes=1-100) are captured as-is, making playback - difficult - -crocoite offers two methods to work around these issues. Firstly it can save a -DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus -``<script>`` tags after the site has been fully loaded and thus can be -displayed without executing scripts. Obviously JavaScript-based navigation -does not work any more. Secondly it also saves a screenshot of the full page, -so even if future browsers cannot render and display the stored HTML a fully -rendered version of the website can be replayed instead. - -Advanced usage --------------- - -crocoite is built with the Unix philosophy (“do one thing and do it well”) in -mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion -use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``. -It can either recurse a maximum number of levels or grab all pages with the -same prefix as the start URL: - -.. code:: bash - - crocoite-recursive --policy prefix http://www.example.com/dir/ output - -will save all pages in ``/dir/`` and below to individual files in the output -directory ``output``. You can customize the command used to grab individual -pages by appending it after ``output``. This way distributed grabs (ssh to a -different machine and execute the job there, queue the command with Slurm, …) -are possible. - -IRC bot -^^^^^^^ - -A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``. -It reads its configuration from a config file like the example provided in -``contrib/chromebot.json`` and supports the following commands: - -a <url> -j <concurrency> -r <policy> - Archive <url> with <concurrency> processes according to recursion <policy> -s <uuid> - Get job status for <uuid> -r <uuid> - Revoke or abort running job with <uuid> - -Browser configuration -^^^^^^^^^^^^^^^^^^^^^ - -Generally crocoite provides reasonable defaults for Google Chrome via its -`devtools module`_. When debugging this software it might be necessary to open -a non-headless instance of the browser by running - -.. code:: bash - - google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs - -and then passing the option ``--browser=http://localhost:9222`` to -``crocoite-grab``. This allows human intervention through the browser’s builtin -console. - -Another issue that might arise is related to fonts. Headless servers usually -don’t have them installed by default and thus rendered screenshots may contain -replacement characters (□) instead of the actual text. This affects mostly -non-latin character sets. It is therefore recommended to install at least -Micrsoft’s Corefonts_ as well as DejaVu_, Liberation_ or a similar font family -covering a wide range of character sets. - -.. _devtools module: crocoite/devtools.py -.. _Corefonts: http://corefonts.sourceforge.net/ -.. _DejaVu: https://dejavu-fonts.github.io/ -.. _Liberation: https://pagure.io/liberation-fonts - -Related projects ----------------- - -brozzler_ - Uses Google Chrome as well, but intercepts traffic using a proxy. Supports - distributed crawling and immediate playback. -Squidwarc_ - Communicates with headless Google Chrome and uses the Network API to - retrieve requests like crocoite. Supports recursive crawls and page - scrolling, but neither custom JavaScript nor distributed crawling. - -.. _brozzler: https://github.com/internetarchive/brozzler -.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc - +doc/index.rst
\ No newline at end of file |