summaryrefslogtreecommitdiff
path: root/README.rst
blob: 45d2e0f65c53248e28f2ec3d721a53f6bf25ba54 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
crocoite
========

Preservation for the modern web, powered by `headless Google
Chrome`_.

.. image:: https://travis-ci.org/PromyLOPh/crocoite.svg?branch=master
    :target: https://travis-ci.org/PromyLOPh/crocoite

.. image:: https://codecov.io/gh/PromyLOPh/crocoite/branch/master/graph/badge.svg
  :target: https://codecov.io/gh/PromyLOPh/crocoite

.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome

Quick start
-----------

These dependencies must be present to run crocoite:

- Python ≥3.6
- PyYAML_
- aiohttp_
- websockets_
- warcio_
- html5lib_
- yarl_
- multidict_
- bottom_ (IRC client)
- `Google Chrome`_

.. _PyYAML: https://pyyaml.org/wiki/PyYAML
.. _aiohttp: https://aiohttp.readthedocs.io/
.. _websockets: https://websockets.readthedocs.io/
.. _warcio: https://github.com/webrecorder/warcio
.. _html5lib: https://github.com/html5lib/html5lib-python
.. _bottom: https://github.com/numberoverzero/bottom
.. _Google Chrome: https://www.google.com/chrome/
.. _yarl: https://yarl.readthedocs.io/
.. _multidict: https://multidict.readthedocs.io/

The following commands clone the repository from GitHub_, set up a virtual
environment and install crocoite:

.. _GitHub: https://github.com/PromyLOPh/crocoite

.. code:: bash

    git clone https://github.com/PromyLOPh/crocoite.git
    cd crocoite
    virtualenv -p python3 sandbox
    source sandbox/bin/activate
    pip install .

One-shot command line interface and pywb_ playback:

.. code:: bash

    pip install pywb
    crocoite-grab http://example.com/ example.com.warc.gz
    rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
    wayback &
    $BROWSER http://localhost:8080

.. _pywb: https://github.com/ikreymer/pywb

Rationale
---------

Most modern websites depend heavily on executing code, usually JavaScript, on
the user’s machine. They also make use of new and emerging Web technologies
like HTML5, WebSockets, service workers and more. Even worse from the
preservation point of view, they also require some form of user interaction to
dynamically load more content (infinite scrolling, dynamic comment loading,
etc).

The naive approach of fetching a HTML page, parsing it and extracting
links to referenced resources therefore is not sufficient to create a faithful
snapshot of these web applications. A full browser, capable of running scripts and
providing modern Web API’s is absolutely required for this task. Thankfully
Google Chrome runs without a display (headless mode) and can be controlled by
external programs, allowing them to navigate and extract or inject data.
This section describes the solutions crocoite offers and explains design
decisions taken.

crocoite captures resources by listening to Chrome’s `network events`_ and
requesting the response body using `Network.getResponseBody`_. This approach
has caveats: The original HTTP requests and responses, as sent over the wire,
are not available. They are reconstructed from parsed data. The character
encoding for text documents is changed to UTF-8. And the content body of HTTP
redirects cannot be retrieved due to a race condition.

.. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
.. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody

But at the same time it allows crocoite to rely on Chrome’s well-tested network
stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
traffic and present a fake certificate to the browser in order to store the
transmitted content.

.. _warcprox: https://github.com/internetarchive/warcprox

WARC records generated by crocoite therefore are an abstract view on the
resource they represent and not necessarily the data sent over the wire. A URL
fetched with HTTP/2 for example will still result in a HTTP/1.1
request/response pair in the WARC file. This may be undesireable from
an archivist’s point of view (“save the data exactly like we received it”). But
this level of abstraction is inevitable when dealing with more than one
protocol.

crocoite also interacts with and therefore alters the grabbed websites. It does
so by injecting `behavior scripts`_ into the site. Typically these are written
in JavaScript, because interacting with a page is easier this way. These
scripts then perform different tasks: Extracting targets from visible
hyperlinks, clicking buttons or scrolling the website to to load more content,
as well as taking a static screenshot of ``<canvas>`` elements for the DOM
snapshot (see below).

.. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data

Replaying archived WARC’s can be quite challenging and might not be possible
with current technology (or even at all):

- Some sites request assets based on screen resolution, pixel ratio and
  supported image formats (webp). Replaying those with different parameters
  won’t work, since assets for those are missing. Example: missguided.com.
- Some fetch different scripts based on user agent. Example: youtube.com.
- Requests containing randomly generated JavaScript callback function names
  won’t work. Example: weather.com.
- Range requests (Range: bytes=1-100) are captured as-is, making playback
  difficult

crocoite offers two methods to work around these issues. Firstly it can save a
DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
``<script>`` tags after the site has been fully loaded and thus can be
displayed without executing scripts.  Obviously JavaScript-based navigation
does not work any more. Secondly it also saves a screenshot of the full page,
so even if future browsers cannot render and display the stored HTML a fully
rendered version of the website can be replayed instead.

Advanced usage
--------------

crocoite is built with the Unix philosophy (“do one thing and do it well”) in
mind. Thus ``crocoite-grab`` can only save a single page. If you want recursion
use ``crocoite-recursive``, which follows hyperlinks according to ``--policy``.
It can either recurse a maximum number of levels or grab all pages with the
same prefix as the start URL:

.. code:: bash

   crocoite-recursive --policy prefix http://www.example.com/dir/ output

will save all pages in ``/dir/`` and below to individual files in the output
directory ``output``. You can customize the command used to grab individual
pages by appending it after ``output``. This way distributed grabs (ssh to a
different machine and execute the job there, queue the command with Slurm, …)
are possible.

IRC bot
^^^^^^^

A simple IRC bot (“chromebot”) is provided with the command ``crocoite-irc``.
It reads its configuration from a config file like the example provided in
``contrib/chromebot.ini`` and supports the following commands:

a <url> -j <concurrency> -r <policy>
    Archive <url> with <concurrency> processes according to recursion <policy>
s <uuid>
    Get job status for <uuid>
r <uuid>
    Revoke or abort running job with <uuid>

Browser configuration
^^^^^^^^^^^^^^^^^^^^^

Generally crocoite provides reasonable defaults for Google Chrome via its
`devtools module`_. When debugging this software it might be necessary to open
a non-headless instance of the browser by running

.. code:: bash

   google-chrome-stable --remote-debugging-port=9222 --auto-open-devtools-for-tabs

and then passing the option ``--browser=http://localhost:9222`` to
``crocoite-grab``. This allows human intervention through the browser’s builtin
console.

Another issue that might arise is related to fonts. Headless servers usually
don’t have them installed by default and thus rendered screenshots may contain
replacement characters (□) instead of the actual text. This affects mostly
non-latin character sets.  It is therefore recommended to install at least
Micrsoft’s Corefonts_ as well as DejaVu_, Liberation_ or a similar font family
covering a wide range of character sets.

.. _devtools module: crocoite/devtools.py
.. _Corefonts: http://corefonts.sourceforge.net/
.. _DejaVu: https://dejavu-fonts.github.io/
.. _Liberation: https://pagure.io/liberation-fonts

Related projects
----------------

brozzler_
    Uses Google Chrome as well, but intercepts traffic using a proxy. Supports
    distributed crawling and immediate playback.
Squidwarc_
    Communicates with headless Google Chrome and uses the Network API to
    retrieve requests like crocoite. Supports recursive crawls and page
    scrolling, but neither custom JavaScript nor distributed crawling.

.. _brozzler: https://github.com/internetarchive/brozzler
.. _Squidwarc: https://github.com/N0taN3rd/Squidwarc