blob: f66da274a69b353c1de16def1c4ee3590b708740 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
|
crocoite
========
Archive websites using Google Chrome and its DevTools protocol.
Tested with Google Chrome 62.0.3202.89 for Linux only.
Dependencies
------------
- Python 3
- pychrome_
- warcio_
- html5lib
.. _pychrome: https://github.com/fate0/pychrome
.. _warcio: https://github.com/webrecorder/warcio
Usage
-----
One-shot commandline interface and pywb_ playback::
google-chrome-stable --window-size=1920,1080 --remote-debugging-port=9222 &
crocoite-standalone http://example.com/ example.com.warc.gz
rm -rf collections && wb-manager init test && wb-manager add test example.com.warc.gz
wayback &
$BROWSER http://localhost:8080
For `headless Google Chrome`_ add the parameters ``--headless --disable-gpu``.
.. _pywb: https://github.com/ikreymer/pywb
.. _headless Google Chrome: https://developers.google.com/web/updates/2017/04/headless-chrome
Caveats
-------
- Original HTTP requests/responses are not available. They are rebuilt from
parsed data. Character encoding for text documents is changed to UTF-8.
- Some sites request assets based on screen resolution, pixel ratio and
supported image formats (webp). Replaying those with different parameters
won’t work, since assets for those are missing. Example: missguided.com.
- Some fetch different scripts based on user agent. Example: youtube.com.
- Requests containing randomly generated JavaScript callback function names
won’t work. Example: weather.com.
Most of these issues can be worked around by using the DOM snapshot, which is
also saved. This causes its own set of issues though:
- JavaScript-based navigation does not work.
- Scripts modifying styles based on scrolling position are stuck at the end of
page state at the moment. Example: twitter.com
- CSS-based asset loading (screen size, pixel ratio, …) still does not work.
- Canvas contents are probably not preserved.
|