summaryrefslogtreecommitdiff
path: root/README.rst
diff options
context:
space:
mode:
Diffstat (limited to 'README.rst')
l---------[-rw-r--r--]README.rst216
1 files changed, 1 insertions, 215 deletions
diff --git a/README.rst b/README.rst
index 581ac13..176d9c2 100644..120000
--- a/README.rst
+++ b/README.rst
@@ -1,215 +1 @@
-crocoite
-========
-
-Preservation for the modern web, powered by `headless Google
-Chrome`_.
-
-.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master
- :target: https://travis-ci.org/PromyLOPh/crocoite
-
-.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg
- :target: https://codecov.io/gh/PromyLOPh/crocoite
-
-.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
-
-Quick start
------------
-
-These dependencies must be present to run crocoite:
-
-- Python ≥3.6
-- PyYAML_
-- aiohttp_
-- websockets_
-- warcio_
-- html5lib_
-- yarl_
-- multidict_
-- bottom_ (IRC client)
-- `Google Chrome`_
-
-.. _PyYAML: https://pyyaml.org/wiki/PyYAML
-.. _aiohttp: https://aiohttp.readthedocs.io/
-.. _websockets: https://websockets.readthedocs.io/
-.. _warcio: https://github.com/webrecorder/warcio
-.. _html5lib: https://github.com/html5lib/html5lib-python
-.. _bottom: https://github.com/numberoverzero/bottom
-.. _Google Chrome: https://www.google.com/chrome/
-.. _yarl: https://yarl.readthedocs.io/
-.. _multidict: https://multidict.readthedocs.io/
-
-The following commands clone the repository from GitHub_, set up a virtual
-environment and install crocoite:
-
-.. _GitHub: https://github.com/PromyLOPh/crocoite
-
-.. code:: bash
-
- git clone https://github.com/PromyLOPh/crocoite.git
- cd crocoite
- virtualenv -p python3 sandbox
- source sandbox/bin/activate
- pip install .
-
-One-shot command line interface and pywb_ playback:
-
-.. code:: bash
-
- pip install pywb
- crocoite-grab http://example.com/ example.com.warc.gz
- rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
- wayback &
- $BROWSER http://localhost:8080
-
-.. _pywb: https://github.com/ikreymer/pywb
-
-Rationale
----------
-
-Most modern websites depend heavily on executing code, usually JavaScript, on
-the user’s machine. They also make use of new and emerging Web technologies
-like HTML5, WebSockets, service workers and more. Even worse from the
-preservation point of view, they also require some form of user interaction to
-dynamically load more content (infinite scrolling, dynamic comment loading,
-etc).
-
-The naive approach of fetching a HTML page, parsing it and extracting
-links to referenced resources therefore is not sufficient to create a faithful
-snapshot of these web applications. A full browser, capable of running scripts and
-providing modern Web API’s is absolutely required for this task. Thankfully
-Google Chrome runs without a display (headless mode) and can be controlled by
-external programs, allowing them to navigate and extract or inject data.
-This section describes the solutions crocoite offers and explains design
-decisions taken.
-
-crocoite captures resources by listening to Chrome’s `network events`_ and
-requesting the response body using `Network.getResponseBody`_. This approach
-has caveats: The original HTTP requests and responses, as sent over the wire,
-are not available. They are reconstructed from parsed data. The character
-encoding for text documents is changed to UTF-8. And the content body of HTTP
-redirects cannot be retrieved due to a race condition.
-
-.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
-.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
-
-But at the same time it allows crocoite to rely on Chrome’s well-tested network
-stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
-transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
-need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
-traffic and present a fake certificate to the browser in order to store the
-transmitted content.
-
-.. _warcprox: https://github.com/internetarchive/warcprox
-
-WARC records generated by crocoite therefore are an abstract view on the
-resource they represent and not necessarily the data sent over the wire. A URL
-fetched with HTTP/2 for example will still result in a HTTP/1.1
-request/response pair in the WARC file. This may be undesireable from
-an archivist’s point of view (“save the data exactly like we received it”). But
-this level of abstraction is inevitable when dealing with more than one
-protocol.
-
-crocoite also interacts with and therefore alters the grabbed websites. It does
-so by injecting `behavior scripts`_ into the site. Typically these are written
-in JavaScript, because interacting with a page is easier this way. These
-scripts then perform different tasks: Extracting targets from visible
-hyperlinks, clicking buttons or scrolling the website to to load more content,
-as well as taking a static screenshot of ``<canvas>`` elements for the DOM
-snapshot (see below).
-
-.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
-
-Replaying archived WARC’s can be quite challenging and might not be possible
-with current technology (or even at all):
-
-- Some sites request assets based on screen resolution, pixel ratio and
- supported image formats (webp). Replaying those with different parameters
- won’t work, since assets for those are missing. Example: missguided.com.
-- Some fetch different scripts based on user agent. Example: youtube.com.
-- Requests containing randomly generated JavaScript callback function names
- won’t work. Example: weather.com.
-- Range requests (Range: bytes=1-100) are captured as-is, making playback
- difficult
-
-crocoite offers two methods to work around these issues. Firstly it can save a
-DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
-``<script>`` tags after the site has been fully loaded and thus can be
-displayed without executing scripts. Obviously JavaScript-based navigation
-does not work any more. Secondly it also saves a screenshot of the full page,
-so even if future browsers cannot render and display the stored HTML a fully
-rendered version of the website can be replayed instead.
-
-Advanced usage
---------------
-
-crocoite is built with the Unix philosophy (“do one thing and do it well”) in
-mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion
-use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``.
-It can either recurse a maximum number of levels or grab all pages with the
-same prefix as the start URL:
-
-.. code:: bash
-
- crocoite-recursive --policy prefix http://www.example.com/dir/ output
-
-will save all pages in ``/dir/`` and below to individual files in the output
-directory ``output``. You can customize the command used to grab individual
-pages by appending it after ``output``. This way distributed grabs (ssh to a
-different machine and execute the job there, queue the command with Slurm, …)
-are possible.
-
-IRC bot
-^^^^^^^
-
-A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``.
-It reads its configuration from a config file like the example provided in
-``contrib/chromebot.json`` and supports the following commands:
-
-a <url> -j <concurrency> -r <policy>
- Archive <url> with <concurrency> processes according to recursion <policy>
-s <uuid>
- Get job status for <uuid>
-r <uuid>
- Revoke or abort running job with <uuid>
-
-Browser configuration
-^^^^^^^^^^^^^^^^^^^^^
-
-Generally crocoite provides reasonable defaults for Google Chrome via its
-`devtools module`_. When debugging this software it might be necessary to open
-a non-headless instance of the browser by running
-
-.. code:: bash
-
- google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs
-
-and then passing the option ``--browser=http://localhost:9222`` to
-``crocoite-grab``. This allows human intervention through the browser’s builtin
-console.
-
-Another issue that might arise is related to fonts. Headless servers usually
-don’t have them installed by default and thus rendered screenshots may contain
-replacement characters (□) instead of the actual text. This affects mostly
-non-latin character sets. It is therefore recommended to install at least
-Micrsoft’s Corefonts_ as well as DejaVu_, Liberation_ or a similar font family
-covering a wide range of character sets.
-
-.. _devtools module: crocoite/devtools.py
-.. _Corefonts: http://corefonts.sourceforge.net/
-.. _DejaVu: https://dejavu-fonts.github.io/
-.. _Liberation: https://pagure.io/liberation-fonts
-
-Related projects
-----------------
-
-brozzler_
- Uses Google Chrome as well, but intercepts traffic using a proxy. Supports
- distributed crawling and immediate playback.
-Squidwarc_
- Communicates with headless Google Chrome and uses the Network API to
- retrieve requests like crocoite. Supports recursive crawls and page
- scrolling, but neither custom JavaScript nor distributed crawling.
-
-.. _brozzler: https://github.com/internetarchive/brozzler
-.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc
-
+doc/index.rst \ No newline at end of file