Age | Commit message (Collapse) | Author | Files | Lines |
|
Move it to .devtools. Seems more fitting.
|
|
WARC-Target-URI was taken from the previous record, even if the URI was
different. This essentially removes the revisited URL from the archive.
Also add a few tests. And boy, warcio is a mess.
|
|
Chrome does not like that.
|
|
This is a direct port to asyncio without any design changes. These need
to happen in further refinements.
Fixes issue #1.
|
|
|
|
Inspired by pychrome/aiochrome, but includes crash handling and async
get() instead of callbacks.
|
|
|
|
This change was lost during the merge of
958563a3602780b48599c27acf212139c2e6904d.
|
|
|
|
These are more sane than the previous super-short defaults. Obviously
this will slow down recursive crawls.
|
|
Use exit status to signal something is wrong. Check it within recursive,
increment crashed counter and do not move the resulting WARC, it might
be broken.
|
|
For some reason with Google Chrome 70 this is not the case any more.
|
|
Using websockets, vue and bulma.
|
|
Wait for remaining jobs to finish without accepting new ones, but still
allow some interaction with the bot (status/revoke).
|
|
|
|
UTC everywhere. Make that clear.
|
|
No easy way to fix this, so just limit to [0, 1] for now.
|
|
Ignore unsupported modes, add tests.
|
|
Add logging, split bot into abstract bot implementation and actual
chromebot implementation, move some reusable checks into decorators.
|
|
For -recursive and -irc
|
|
|
|
chromebot is back! Dropping sopel, because it does not work well with
asyncio.
|
|
HTTP(S) only.
|
|
❤️ asyncio.
|
|
Simple and sequential.
|
|
Consumers can read the latest gossip faster now.
|
|
|
|
Gonna rewrite that properly.
|
|
|
|
Fixes test failures. Very fragile code unfortunately.
|
|
Just truncate the WARC record like we do with responses. Also add a few
tests, but they’re not covering the call to getRequestPostData. Not sure
what we have to do here.
|
|
|
|
Change warcinfo record format to JSON (this is permitted by the specs)
and add Python version, dependencies and their versions as well as file
hashes.
This should give us enough information to figure out the exact
environment used to create the WARC.
|
|
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC
files. Add it again, but with a different implementation.. Credits to
structlog for inspiration.
|
|
|
|
Specifically for a) redirects (body missing) b) bodies larger than size
limit and c) whenever we couldn’t fetch the response body for whatever
reason.
We gave it our best shot, but still failed miserably. Future generations
will certainly appreciate that. Eh, maybe. Hopefully. Will they?
|
|
Judging from the docs this is the proper way to store these resources.
Enable both for the IRC bot by default, since they won’t interfere with
IA’s wayback machine.
|
|
|
|
Increase coverage.
|
|
It just seems a little nicer than plain old unittest
|
|
This is mainly a quality of life change
|
|
Previously a browser crash stalled the entire grab, since events from
pychrome were handled asynchronously in a different thread and
exceptions were not propagated to the main thread.
Now all browser events are stored in a queue and processed by the main
thread, allowing us to handle browser crashes gracefully (more or less).
This made the following additional changes necessary:
- Clear separation between producer (browser) and consumer (WARC, stats,
…)
- Behavior scripts now yield events as well, instead of accessing the
WARC writer
- WARC logging was removed (for now) and WARC writer does not require
serialization any more
|
|
It was replaced by --remote-debugging-pipe in version 67. pychrome does
not support that out of the box, so instead we’ll let Chrome choose its
own port and poll a file in its user-data-dir.
|
|
Otherwise it may clash with symbols defined by the page.
|
|
|
|
|
|
Move contrib/ scripts to .tools and add entry points to setup.py, rename
crocoite-standalone to crocoite-grab.
|
|
|
|
|
|
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
|