Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
In preparation for 1.0 release:
- Correct mime types
- Add X-Crocoite-Type, so logs, scripts, dom-snapshots and screenshots
can be identified easily
- Remove random WARC headers like X-Chrome-Initiator. We don’t want to
maintain those.
- Remove non-standard urn-based package URLs. Can’t use them without a
urn-registration
|
|
Fixes #14, but needs a test case.
|
|
Chrome’s behavior wrt screeshots changed in some version, so now
artificially extending the viewport via device metrics is required.
|
|
Fixes #18.
|
|
|
|
…to controller and behavior
|
|
load_all is deprecated. A safe YAML subset is fine for our purpose. See
https://msg.pyyaml.org/load
|
|
We may not be able to reproduce every failure, so logging as much as
possible is important to figure out what went wrong. Also, in case a bug
is uncovered in the future, we can check the logs and possibly fix it
with -errata.
|
|
Fails if the page is reloaded/redirected. See issue #13.
|
|
|
|
Using hyothesis-based testcase generation. This is quite nice compared
to manual test data generation, since it catches alot more corner cases
(if done right).
This commit also fixes a few issues, including:
- log records will only be written if the log is nonempty
- properly quote packageUrl path’s
- drop old thread checking code
- use placeholder url for scripts without name
|
|
Replaces str.format, which is less readable due to its separation of
format and arguments.
|
|
Use library yarl (already pulled in by aiohttp). No URL processed should
be a string.
|
|
click.js’s data was part of the script before
22adde79940d32c5f094f26f3e18b7160e7ccafc. Now it is injected
dynamically, but it still would be nice to have the data available.
|
|
|
|
|
|
Seems to be working again. Chrome bug?
|
|
|
|
|
|
First step of issue #3
|
|
|
|
- Introduce stop() method callable from Python. Looks like the old
method (global variable) was not working (any more?). This is much
better anyway.
- Restore state of scrolled elements (not window). Fixes weird
screenshots of twitter.com.
|
|
This is a direct port to asyncio without any design changes. These need
to happen in further refinements.
Fixes issue #1.
|
|
For some reason with Google Chrome 70 this is not the case any more.
|
|
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC
files. Add it again, but with a different implementation.. Credits to
structlog for inspiration.
|
|
Judging from the docs this is the proper way to store these resources.
Enable both for the IRC bot by default, since they won’t interfere with
IA’s wayback machine.
|
|
|
|
This is mainly a quality of life change
|
|
Previously a browser crash stalled the entire grab, since events from
pychrome were handled asynchronously in a different thread and
exceptions were not propagated to the main thread.
Now all browser events are stored in a queue and processed by the main
thread, allowing us to handle browser crashes gracefully (more or less).
This made the following additional changes necessary:
- Clear separation between producer (browser) and consumer (WARC, stats,
…)
- Behavior scripts now yield events as well, instead of accessing the
WARC writer
- WARC logging was removed (for now) and WARC writer does not require
serialization any more
|
|
Otherwise it may clash with symbols defined by the page.
|
|
|
|
…and not just the current viewport. Due to limitations within Chrome it
may be necessary to manually stitch multiple images if the page height
exceeds 16k pixels.
|
|
Configureable. Clicks elements matching one (or more) CSS selectors once
or multiple times.
Currently supported: Facebook, Twitter, Disqus (embedded iframe)
|
|
The “load more” button does not exist any more.
|
|
No functional changes, just cleanup. Replaces onload and onsnapshot
events. Move screen metric emulation, DOM snapshots and screenshots here
as well.
|
|
|