Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
We may not be able to reproduce every failure, so logging as much as
possible is important to figure out what went wrong. Also, in case a bug
is uncovered in the future, we can check the logs and possibly fix it
with -errata.
|
|
RecursiveController used a custom .cancel() method before. Instead we
can simply cancel .run() and handle the CancelledError inside run() and
fetch().
|
|
Use library yarl (already pulled in by aiohttp). No URL processed should
be a string.
|
|
Has been replaced by handler a while ago.
|
|
|
|
Allow cancellation of timeout wait.
|
|
Fix a few random issues pointed out by pylint, mainly unused imports.
|
|
Move it to .devtools. Seems more fitting.
|
|
This is a direct port to asyncio without any design changes. These need
to happen in further refinements.
Fixes issue #1.
|
|
These are more sane than the previous super-short defaults. Obviously
this will slow down recursive crawls.
|
|
Use exit status to signal something is wrong. Check it within recursive,
increment crashed counter and do not move the resulting WARC, it might
be broken.
|
|
Using websockets, vue and bulma.
|
|
Wait for remaining jobs to finish without accepting new ones, but still
allow some interaction with the bot (status/revoke).
|
|
|
|
Add logging, split bot into abstract bot implementation and actual
chromebot implementation, move some reusable checks into decorators.
|
|
For -recursive and -irc
|
|
|
|
chromebot is back! Dropping sopel, because it does not work well with
asyncio.
|
|
❤️ asyncio.
|
|
Simple and sequential.
|
|
|
|
Gonna rewrite that properly.
|
|
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC
files. Add it again, but with a different implementation.. Credits to
structlog for inspiration.
|
|
Previously a browser crash stalled the entire grab, since events from
pychrome were handled asynchronously in a different thread and
exceptions were not propagated to the main thread.
Now all browser events are stored in a queue and processed by the main
thread, allowing us to handle browser crashes gracefully (more or less).
This made the following additional changes necessary:
- Clear separation between producer (browser) and consumer (WARC, stats,
…)
- Behavior scripts now yield events as well, instead of accessing the
WARC writer
- WARC logging was removed (for now) and WARC writer does not require
serialization any more
|
|
|
|
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
|
|
|
|
Only local right now, not distributed.
|
|
|
|
In preparation for recursive crawls.
|
|
|
|
No functional changes, just cleanup. Replaces onload and onsnapshot
events. Move screen metric emulation, DOM snapshots and screenshots here
as well.
|
|
|
|
We need a better solution for this. Sites loading a lot of responsive
images easily need a minute after resizing.
|
|
Logger and SiteWriter both access .write_record() concurrently, which
can corrupt WARC files. Move the writer to its own thread and decouple
it with a queue. Since we’re probably I/O-bound this may speed up
writeback as well.
|
|
|
|
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs
some love, but it should work.
|
|
Unless --browser argument is given. Uses sane settings and a temporary
profile directory.
|
|
|
|
HTMLSerializer uses the exact string given in <meta charset=X>, thus it
should be with hyphen.
|
|
|
|
|
|
Reusable browser communication and WARC writing.
|
|
Some tags are “void”, i.e. cannot contain contents and don’t have a
closing tag.
|
|
|
|
Apparently neither width=0, height=0 nor clearDeviceMetricsOverride() do
what they should, so manually reset to 1080p screen size.
|
|
They can carry JavaScript as well and should not be allowed for DOM
snapshots.
|
|
|
|
Request all subdocuments with pierce=True, split the result and save
each document. Playback with pywb works, because timestamps of the
snapshots are close to each other.
|