diff options
Diffstat (limited to 'README.rst')
-rw-r--r-- | README.rst | 123 |
1 files changed, 6 insertions, 117 deletions
@@ -1,126 +1,15 @@ crocoite ======== -Archive websites using `headless Google Chrome_` and its DevTools protocol. +.. code:: bash -.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome - -Dependencies ------------- - -- Python 3 -- pychrome_ -- warcio_ -- html5lib_ -- Celery_ - -.. _pychrome: https://github.com/fate0/pychrome -.. _warcio: https://github.com/webrecorder/warcio -.. _html5lib: https://github.com/html5lib/html5lib-python -.. _Celery: http://www.celeryproject.org/ - -Usage ------ - -One-shot commandline interface and pywb_ playback:: - - crocoite-standalone http://example.com/ example.com.warc.gz - rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz + pip install crocoite pywb + crocoite http://example.com/ example.com.warc.gz + wb-manager init test && wb-manager add test example.com.warc.gz wayback & $BROWSER http://localhost:8080 -.. _pywb: https://github.com/ikreymer/pywb - -Behavior scripts -^^^^^^^^^^^^^^^^ - -A lot of sites need some form of interaction to dynamically load more content. Twitter for -instance continously loads new posts when scrolling to the bottom of the page. -crocoite can emulate these user interactions (and more) by combining control -code written in Python and injecting JavaScript into the page. The code can be -limited to certain URLs or apply to every page loaded. By default all scripts -available are enabled, see command line flag ``--behavior``. - -Caveats -------- - -- Original HTTP requests/responses are not available. They are rebuilt from - parsed data. Character encoding for text documents is changed to UTF-8. -- Some sites request assets based on screen resolution, pixel ratio and - supported image formats (webp). Replaying those with different parameters - won’t work, since assets for those are missing. Example: missguided.com. -- Some fetch different scripts based on user agent. Example: youtube.com. -- Requests containing randomly generated JavaScript callback function names - won’t work. Example: weather.com. -- Range requests (Range: bytes=1-100) are captured as-is, making playback - difficult -- Content body of HTTP redirects cannot be retrived due to race condition - -Most of these issues can be worked around by using the DOM snapshot, which is -also saved. This causes its own set of issues though: - -- JavaScript-based navigation does not work. - -Distributed crawling --------------------- - -Configure using celeryconfig.py - -.. code:: python - - broker_url = 'pyamqp://' - result_backend = 'rpc://' - warc_filename = '{domain}-{date}-{id}.warc.gz' - temp_dir = '/tmp/' - finished_dir = '/tmp/finished' - -Start a Celery worker:: - - celery -A crocoite.cli worker --loglevel=info - -Then queue archive job:: - - crocoite-standalone --distributed http://example.com '' - -The worker will create a temporary file named according to ``warc_filename`` in -``/tmp`` while archiving and move it to ``/tmp/finished`` when done. - -IRC bot -^^^^^^^ - -Configure sopel_ (``~/.sopel/default.cfg``) to use the plugin located in -``contrib/celerycrocoite.py`` - -.. code:: ini - - [core] - nick = chromebot - host = irc.efnet.fr - port = 6667 - owner = someone - extra = /path/to/crocoite/contrib - enable = celerycrocoite - channels = #somechannel - -Then start it by running ``sopel``. The bot must be addressed directly (i.e. -``chromebot: <command>``). The following commands are currently supported: - -ao <url> - Archives <url> and all of its resources (images, css, …). A unique UID - (UUID) is assigned to each job. -s <uuid> - Get status of job with <uuid> -r <uuid> - Revoke job with <uuid>. If it started already the job will be killed. - -.. _sopel: https://sopel.chat/ - -Related projects ----------------- - -brozzler_ - Uses Google Chrome as well, but intercepts traffic using a proxy. Supports - distributed crawling and immediate playback. +See documentation_ for more information. -.. _brozzler: https://github.com/internetarchive/brozzler +.. _documentation: https://6xq.net/crocoite/ |