summaryrefslogtreecommitdiff
path: root/README.rst
blob: 08e14ba243bc4afd01b0db6c288d4d79cbd44703 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
crocoite
========

Archive websites using `headless Google Chrome`_ and its DevTools protocol.

.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master
    :target: https://travis-ci.org/PromyLOPh/crocoite

.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome

Dependencies
------------

- Python 3
- pychrome_ 
- warcio_
- html5lib_
- Celery_

.. _pychrome: https://github.com/fate0/pychrome
.. _warcio: https://github.com/webrecorder/warcio
.. _html5lib: https://github.com/html5lib/html5lib-python
.. _Celery: http://www.celeryproject.org/

Usage
-----

One-shot commandline interface and pywb_ playback::

    crocoite-grab --output example.com.warc.gz http://example.com/
    rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
    wayback &
    $BROWSER http://localhost:8080

.. _pywb: https://github.com/ikreymer/pywb

Behavior scripts
^^^^^^^^^^^^^^^^

A lot of sites need some form of interaction to dynamically load more content. Twitter for
instance continously loads new posts when scrolling to the bottom of the page.
crocoite can emulate these user interactions (and more) by combining control
code written in Python and injecting JavaScript into the page. The code can be
limited to certain URLs or apply to every page loaded. By default all scripts
available are enabled, see command line flag ``--behavior``.

Caveats
-------

- Original HTTP requests/responses are not available. They are rebuilt from
  parsed data. Character encoding for text documents is changed to UTF-8.
- Some sites request assets based on screen resolution, pixel ratio and
  supported image formats (webp). Replaying those with different parameters
  won’t work, since assets for those are missing. Example: missguided.com.
- Some fetch different scripts based on user agent. Example: youtube.com.
- Requests containing randomly generated JavaScript callback function names
  won’t work. Example: weather.com.
- Range requests (Range: bytes=1-100) are captured as-is, making playback
  difficult
- Content body of HTTP redirects cannot be retrived due to race condition

Most of these issues can be worked around by using the DOM snapshot, which is
also saved. This causes its own set of issues though:

- JavaScript-based navigation does not work.

Distributed crawling
--------------------

Configure using celeryconfig.py

.. code:: python

    broker_url = 'pyamqp://'
    result_backend = 'rpc://'
    warc_filename = '{domain}-{date}-{id}.warc.gz'
    temp_dir = '/tmp/'
    finished_dir = '/tmp/finished'

Start a Celery worker::

    celery -A crocoite.task worker -Q crocoite.archive,crocoite.controller --loglevel=info

Then queue archive job::

    crocoite-grab --distributed http://example.com

The worker will create a temporary file named according to ``warc_filename`` in
``/tmp`` while archiving and move it to ``/tmp/finished`` when done.

IRC bot
^^^^^^^

Configure sopel_ (``~/.sopel/default.cfg``) to use the plugin located in
``contrib/celerycrocoite.py``

.. code:: ini

    [core]
    nick = chromebot
    host = irc.efnet.fr
    port = 6667
    owner = someone
    extra = /path/to/crocoite/contrib
    enable = celerycrocoite
    channels = #somechannel

Then start it by running ``sopel``. The bot must be addressed directly (i.e.
``chromebot: <command>``). The following commands are currently supported:

a <url>
    Archives <url> and all of its resources (images, css, …). A unique UID
    (UUID) is assigned to each job.
s <uuid>
    Get status of job with <uuid>
r <uuid>
    Revoke job with <uuid>. If it started already the job will be killed.

.. _sopel: https://sopel.chat/

Related projects
----------------

brozzler_
    Uses Google Chrome as well, but intercepts traffic using a proxy. Supports
    distributed crawling and immediate playback.
Squidwarc_
    Communicates with headless Google Chrome and uses the Network API to
    retrieve requests like crocoite. Supports recursive crawls and page
    scrolling, but neither custom JavaScript nor distributed crawling.

.. _brozzler: https://github.com/internetarchive/brozzler
.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc