Age | Commit message (Collapse) | Author | Files | Lines |
|
If the browser goes idle before we enter `while True` we never notice
and thus the idleTimeout is never applied.
|
|
Probably broken by the transition to URL() in commit
5e444dd6511d97308a84ae9c86ebf14547d01f01
And yes, we desperately need some tests for this.
|
|
Previously Item was just a simple wrapper around Chrome’s Network.*
events. This turned out to be quite nasty when testing, so its
replacement, RequestResponsePair, does some level of abstraction. This
makes testing alot easier, since we now can simply instantiate it
without building a proper DevTools event.
Should come without any functional changes.
|
|
Replaces str.format, which is less readable due to its separation of
format and arguments.
|
|
Broken by commit 5e444dd6511d97308a84ae9c86ebf14547d01f01. URL’s read
from stdin must be converted from str.
|
|
RecursiveController used a custom .cancel() method before. Instead we
can simply cancel .run() and handle the CancelledError inside run() and
fetch().
|
|
Crash detection was moved into -recursive’s return code checking a while
ago.
|
|
Use library yarl (already pulled in by aiohttp). No URL processed should
be a string.
|
|
|
|
In preparation for #9.
I was hoping to reuse one of schema.org’s microdata schema’s, but
neither Action (archival action) nor SoftwareApplication (version
information) seem to be suitable.
|
|
|
|
Has been replaced by handler a while ago.
|
|
Allow cancellation of timeout wait.
|
|
- Introduce stop() method callable from Python. Looks like the old
method (global variable) was not working (any more?). This is much
better anyway.
- Restore state of scrolled elements (not window). Fixes weird
screenshots of twitter.com.
|
|
|
|
Add parameters the grab was run with, so we can actually reproduce a
run.
|
|
Fix a few random issues pointed out by pylint, mainly unused imports.
|
|
Move it to .devtools. Seems more fitting.
|
|
This is a direct port to asyncio without any design changes. These need
to happen in further refinements.
Fixes issue #1.
|
|
This change was lost during the merge of
958563a3602780b48599c27acf212139c2e6904d.
|
|
|
|
Use exit status to signal something is wrong. Check it within recursive,
increment crashed counter and do not move the resulting WARC, it might
be broken.
|
|
Using websockets, vue and bulma.
|
|
|
|
No easy way to fix this, so just limit to [0, 1] for now.
|
|
For -recursive and -irc
|
|
HTTP(S) only.
|
|
❤️ asyncio.
|
|
Simple and sequential.
|
|
|
|
Gonna rewrite that properly.
|
|
Change warcinfo record format to JSON (this is permitted by the specs)
and add Python version, dependencies and their versions as well as file
hashes.
This should give us enough information to figure out the exact
environment used to create the WARC.
|
|
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC
files. Add it again, but with a different implementation.. Credits to
structlog for inspiration.
|
|
Judging from the docs this is the proper way to store these resources.
Enable both for the IRC bot by default, since they won’t interfere with
IA’s wayback machine.
|
|
This is mainly a quality of life change
|
|
Previously a browser crash stalled the entire grab, since events from
pychrome were handled asynchronously in a different thread and
exceptions were not propagated to the main thread.
Now all browser events are stored in a queue and processed by the main
thread, allowing us to handle browser crashes gracefully (more or less).
This made the following additional changes necessary:
- Clear separation between producer (browser) and consumer (WARC, stats,
…)
- Behavior scripts now yield events as well, instead of accessing the
WARC writer
- WARC logging was removed (for now) and WARC writer does not require
serialization any more
|
|
|
|
|
|
Only local right now, not distributed.
|
|
|
|
In preparation for recursive crawls.
|