Age | Commit message (Collapse) | Author | Files | Lines |
|
Specifically for a) redirects (body missing) b) bodies larger than size
limit and c) whenever we couldn’t fetch the response body for whatever
reason.
We gave it our best shot, but still failed miserably. Future generations
will certainly appreciate that. Eh, maybe. Hopefully. Will they?
|
|
Judging from the docs this is the proper way to store these resources.
Enable both for the IRC bot by default, since they won’t interfere with
IA’s wayback machine.
|
|
|
|
This is mainly a quality of life change
|
|
Previously a browser crash stalled the entire grab, since events from
pychrome were handled asynchronously in a different thread and
exceptions were not propagated to the main thread.
Now all browser events are stored in a queue and processed by the main
thread, allowing us to handle browser crashes gracefully (more or less).
This made the following additional changes necessary:
- Clear separation between producer (browser) and consumer (WARC, stats,
…)
- Behavior scripts now yield events as well, instead of accessing the
WARC writer
- WARC logging was removed (for now) and WARC writer does not require
serialization any more
|
|
In preparation for recursive crawls.
|
|
|
|
If there is any and it was not included in the response already.
|
|
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
|
|
|
|
|
|
No functional changes, just cleanup. Replaces onload and onsnapshot
events. Move screen metric emulation, DOM snapshots and screenshots here
as well.
|
|
|
|
+refactoring.
|
|
This is an undocumented DevTools feature.
|
|
Logger and SiteWriter both access .write_record() concurrently, which
can corrupt WARC files. Move the writer to its own thread and decouple
it with a queue. Since we’re probably I/O-bound this may speed up
writeback as well.
|
|
We can’t do that safely due to a race-condition.
|
|
|
|
Reusable browser communication and WARC writing.
|