summaryrefslogtreecommitdiff
path: root/README.rst
blob: 59629fe1e25a9e29fc7b1f850f09f545131307c6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
swayback
========

This is a proof of concept for Service Worker-based web app replay, similar to
archive.org’s Wayback Machine.

Rationale
---------

Traditionally replaying websites relied heavily on rewriting URL’s in static
HTML pages to adapt them to a new origin and path hierarchy (i.e.
``https://web.archive.org/web/<date>/<url>``). With the rise of web apps, which
load their content dynamically, this is no longer sufficient.

Let’s look at Instagram as an example for this: User’s profiles dynamically
load content to implement “infinite scrolling”. The corresponding request is a
GraphQL query, which returns JSON-encoded data with an application-defined
structure.  This response includes URL’s to images, which must be rewritten as
well, in order for replay to work correctly. So the replay software needs to
parse and rewrite JSON as well as HTML.

However, this response could have used an arbitrary serialization format and
may contain relative URL’s or just values used in a URL template, which are
more difficult to spot than absolute URL’s. This makes server-side rewriting
difficult and cumbersome, perhaps even impossible.

Implementation
--------------

Instead swayback relies on a new web technology called *Service Workers*. These
can be installed for a given domain and path prefix. They basically act as a
proxy between the browser and server, allowing them to intercept and rewrite
any request a web app makes. Which is exactly what we need to properly replay
archived web apps.

So swayback provides an HTTP server, responing to queries for the wildcard
domain ``*.swayback.localhost``. The page served first installs a service
worker and then reloads the page. Now the service worker is in control of
network requests and rewrites a request like (for instance)
``www.instagram.com.swayback.localhost:5000/bluebellwooi/`` to
``swayback.localhost:5000/raw`` with the real URL in the POST request body.
swayback’s server looks up that URL in the WARC files provided and and replies
with the original server’s response, which is then returned by the service
worker to the browser without modification.

Usage
-----

Since this is a proof of concept functionality is quite limited. You’ll need
the following python packages:

- flask
- warcio

swayback uses the hardcoded domain ``swayback.localhost``, which means you need
to set up your DNS resolver accordingly. An example for unbound looks like
this:

.. code:: unbound

    local-zone: "swayback.localhost." redirect
    local-data: "swayback.localhost. 30 IN A 127.0.0.1"

After you recorded some WARCs move them into swayback’s base directory and run:

.. code:: bash

    export FLASK_APP=swayback/__init__.py
    export FLASK_DEBUG=1
    flask run --with-threads

Then navigate to http://swayback.localhost:5000, which (hopefully) lists all
HTML pages found in those WARC files.

Caveats
-------

- Hardcoded replay domain
- URL lookup is broken, only HTTPS sites work correctly

Related projects
----------------

This approach complements efforts such as crocoite_, a web crawler based on
Google Chrome.

.. _crocoite: https://github.com/PromyLOPh/crocoite