Age | Commit message (Collapse) | Author | Files | Lines |
|
Previously Item was just a simple wrapper around Chrome’s Network.*
events. This turned out to be quite nasty when testing, so its
replacement, RequestResponsePair, does some level of abstraction. This
makes testing alot easier, since we now can simply instantiate it
without building a proper DevTools event.
Should come without any functional changes.
|
|
Using hyothesis-based testcase generation. This is quite nice compared
to manual test data generation, since it catches alot more corner cases
(if done right).
This commit also fixes a few issues, including:
- log records will only be written if the log is nonempty
- properly quote packageUrl path’s
- drop old thread checking code
- use placeholder url for scripts without name
|
|
Replaces str.format, which is less readable due to its separation of
format and arguments.
|
|
Use library yarl (already pulled in by aiohttp). No URL processed should
be a string.
|
|
Fix a few random issues pointed out by pylint, mainly unused imports.
|
|
This is a direct port to asyncio without any design changes. These need
to happen in further refinements.
Fixes issue #1.
|
|
Just truncate the WARC record like we do with responses. Also add a few
tests, but they’re not covering the call to getRequestPostData. Not sure
what we have to do here.
|
|
|
|
Change warcinfo record format to JSON (this is permitted by the specs)
and add Python version, dependencies and their versions as well as file
hashes.
This should give us enough information to figure out the exact
environment used to create the WARC.
|
|
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC
files. Add it again, but with a different implementation.. Credits to
structlog for inspiration.
|
|
Specifically for a) redirects (body missing) b) bodies larger than size
limit and c) whenever we couldn’t fetch the response body for whatever
reason.
We gave it our best shot, but still failed miserably. Future generations
will certainly appreciate that. Eh, maybe. Hopefully. Will they?
|
|
Judging from the docs this is the proper way to store these resources.
Enable both for the IRC bot by default, since they won’t interfere with
IA’s wayback machine.
|
|
|
|
This is mainly a quality of life change
|
|
Previously a browser crash stalled the entire grab, since events from
pychrome were handled asynchronously in a different thread and
exceptions were not propagated to the main thread.
Now all browser events are stored in a queue and processed by the main
thread, allowing us to handle browser crashes gracefully (more or less).
This made the following additional changes necessary:
- Clear separation between producer (browser) and consumer (WARC, stats,
…)
- Behavior scripts now yield events as well, instead of accessing the
WARC writer
- WARC logging was removed (for now) and WARC writer does not require
serialization any more
|
|
In preparation for recursive crawls.
|
|
|
|
If there is any and it was not included in the response already.
|
|
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
|
|
|
|
|
|
No functional changes, just cleanup. Replaces onload and onsnapshot
events. Move screen metric emulation, DOM snapshots and screenshots here
as well.
|
|
|
|
+refactoring.
|
|
This is an undocumented DevTools feature.
|
|
Logger and SiteWriter both access .write_record() concurrently, which
can corrupt WARC files. Move the writer to its own thread and decouple
it with a queue. Since we’re probably I/O-bound this may speed up
writeback as well.
|
|
We can’t do that safely due to a race-condition.
|
|
|
|
Reusable browser communication and WARC writing.
|