summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--README.rst35
1 files changed, 25 insertions, 10 deletions
diff --git a/README.rst b/README.rst
index 59629fe..fd4ee93 100644
--- a/README.rst
+++ b/README.rst
@@ -12,15 +12,15 @@ HTML pages to adapt them to a new origin and path hierarchy (i.e.
``https://web.archive.org/web/<date>/<url>``). With the rise of web apps, which
load their content dynamically, this is no longer sufficient.
-Let’s look at Instagram as an example for this: User’s profiles dynamically
-load content to implement “infinite scrolling”. The corresponding request is a
-GraphQL query, which returns JSON-encoded data with an application-defined
-structure. This response includes URL’s to images, which must be rewritten as
-well, in order for replay to work correctly. So the replay software needs to
-parse and rewrite JSON as well as HTML.
+Instagram is an example for this: User’s profiles dynamically load content to
+implement “infinite scrolling”. The corresponding request is a GraphQL query,
+which returns JSON-encoded data with an application-defined structure. This
+response includes URL’s to images, which must be rewritten as well, in order
+for replay to work correctly. So the replay software needs to parse and rewrite
+JSON as well as HTML.
However, this response could have used an arbitrary serialization format and
-may contain relative URL’s or just values used in a URL template, which are
+may contain relative URL’s or just values used in a URL template. Both are
more difficult to spot than absolute URL’s. This makes server-side rewriting
difficult and cumbersome, perhaps even impossible.
@@ -30,16 +30,16 @@ Implementation
Instead swayback relies on a new web technology called *Service Workers*. These
can be installed for a given domain and path prefix. They basically act as a
proxy between the browser and server, allowing them to intercept and rewrite
-any request a web app makes. Which is exactly what we need to properly replay
+any request a web app makes. This is exactly what is needed to properly replay
archived web apps.
-So swayback provides an HTTP server, responing to queries for the wildcard
+swayback provides an HTTP server, responing to queries for the wildcard
domain ``*.swayback.localhost``. The page served first installs a service
worker and then reloads the page. Now the service worker is in control of
network requests and rewrites a request like (for instance)
``www.instagram.com.swayback.localhost:5000/bluebellwooi/`` to
``swayback.localhost:5000/raw`` with the real URL in the POST request body.
-swayback’s server looks up that URL in the WARC files provided and and replies
+swayback’s server looks up that URL in the WARC files provided and replies
with the original server’s response, which is then returned by the service
worker to the browser without modification.
@@ -84,5 +84,20 @@ Related projects
This approach complements efforts such as crocoite_, a web crawler based on
Google Chrome.
+Reconstructive_/ipwb_
+ Uses Sevice Worker to intercept and rewrite requests. Relies on Referer
+ header. Rewrites links inside HTML pages using Regular Expressions before
+ passing them to the browser. See `Client-side Reconstruction of Composite
+ Mementos Using ServiceWorker`__.
+
+ __ http://www.cs.odu.edu/%7Emkelly/papers/2017_jcdl_serviceWorker.pdf
+pywb_
+ Uses `rewrite modules`_ to alter URLs in HTML pages/JSON
+ responses/cookies/…
+
+.. _rewrite modules: https://github.com/webrecorder/pywb/tree/master/pywb/rewrite
+.. _pywb: https://github.com/webrecorder/pywb/
.. _crocoite: https://github.com/PromyLOPh/crocoite
+.. _Reconstructive: https://github.com/oduwsdl/Reconstructive/
+.. _ipwb: https://github.com/oduwsdl/ipwb/