Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2019-05-05 | cli: Allow adding extra data to warcinfo record | Lars-Dominik Braun | 1 | -2/+5 | |
2019-03-22 | Move documentation to Sphinx | Lars-Dominik Braun | 1 | -0/+44 | |
2019-03-16 | browser: Raise exception if navigation failed | Lars-Dominik Braun | 1 | -1/+4 | |
Stop early if there’s nothing to do. | |||||
2019-03-16 | Add more debug messages | Lars-Dominik Braun | 1 | -0/+9 | |
…to controller and behavior | |||||
2019-03-08 | irc: Add config option need_voice | Lars-Dominik Braun | 1 | -0/+1 | |
Do not hardcode required priviledge to use bot, make it configureable. | |||||
2019-01-27 | Support manhole debugging | Lars-Dominik Braun | 1 | -0/+5 | |
Add optional support for manhole to all cli tools. Activated by signal USR1. | |||||
2019-01-27 | irc: Add URL blacklist | Lars-Dominik Braun | 1 | -1/+3 | |
2019-01-27 | irc: Switch configuration to JSON | Lars-Dominik Braun | 1 | -12/+12 | |
2019-01-07 | Log Chrome’s responses to WARC by default | Lars-Dominik Braun | 1 | -1/+2 | |
We may not be able to reproduce every failure, so logging as much as possible is important to figure out what went wrong. Also, in case a bug is uncovered in the future, we can check the logs and possibly fix it with -errata. | |||||
2018-12-22 | Switch -recursive to asyncio’s .cancel() | Lars-Dominik Braun | 1 | -2/+3 | |
RecursiveController used a custom .cancel() method before. Instead we can simply cancel .run() and handle the CancelledError inside run() and fetch(). | |||||
2018-12-21 | Parse URLs by default | Lars-Dominik Braun | 1 | -2/+3 | |
Use library yarl (already pulled in by aiohttp). No URL processed should be a string. | |||||
2018-12-02 | controller: Remove unused argument | Lars-Dominik Braun | 1 | -1/+1 | |
Has been replaced by handler a while ago. | |||||
2018-12-01 | cli: Fix --behavior | Lars-Dominik Braun | 1 | -2/+3 | |
2018-11-25 | single: Graceful ^C | Lars-Dominik Braun | 1 | -1/+5 | |
Allow cancellation of timeout wait. | |||||
2018-11-19 | Coding style | Lars-Dominik Braun | 1 | -8/+5 | |
Fix a few random issues pointed out by pylint, mainly unused imports. | |||||
2018-11-14 | Async chrome process startup | Lars-Dominik Braun | 1 | -3/+3 | |
Move it to .devtools. Seems more fitting. | |||||
2018-11-06 | Switch single mode to asyncio | Lars-Dominik Braun | 1 | -6/+7 | |
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1. | |||||
2018-10-30 | Increase default timeouts | Lars-Dominik Braun | 1 | -2/+2 | |
These are more sane than the previous super-short defaults. Obviously this will slow down recursive crawls. | |||||
2018-10-23 | single: Set and recursive: check exit status | Lars-Dominik Braun | 1 | -5/+20 | |
Use exit status to signal something is wrong. Check it within recursive, increment crashed counter and do not move the resulting WARC, it might be broken. | |||||
2018-10-14 | irc: Add PoC dashboard | Lars-Dominik Braun | 1 | -0/+8 | |
Using websockets, vue and bulma. | |||||
2018-10-14 | irc: Graceful bot shutdown | Lars-Dominik Braun | 1 | -3/+7 | |
Wait for remaining jobs to finish without accepting new ones, but still allow some interaction with the bot (status/revoke). | |||||
2018-10-11 | recursive: Gracefully shut down on SIGINT/TERM | Lars-Dominik Braun | 1 | -1/+4 | |
2018-10-02 | irc: Refactoring/beautification | Lars-Dominik Braun | 1 | -3/+6 | |
Add logging, split bot into abstract bot implementation and actual chromebot implementation, move some reusable checks into decorators. | |||||
2018-09-29 | Add documentation | Lars-Dominik Braun | 1 | -1/+6 | |
For -recursive and -irc | |||||
2018-09-29 | irc: Limit number of processes spawned | Lars-Dominik Braun | 1 | -1/+2 | |
2018-09-29 | Add simple IRC bot | Lars-Dominik Braun | 1 | -0/+19 | |
chromebot is back! Dropping sopel, because it does not work well with asyncio. | |||||
2018-09-25 | Parallelize recursive grabs | Lars-Dominik Braun | 1 | -1/+3 | |
❤️ asyncio. | |||||
2018-09-25 | Add recursive controller | Lars-Dominik Braun | 1 | -0/+40 | |
Simple and sequential. | |||||
2018-09-25 | Log extracted links | Lars-Dominik Braun | 1 | -2/+2 | |
2018-08-21 | Remove celery and recursion | Lars-Dominik Braun | 1 | -53/+20 | |
Gonna rewrite that properly. | |||||
2018-08-04 | Reintroduce WARC logging | Lars-Dominik Braun | 1 | -8/+8 | |
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration. | |||||
2018-06-20 | Synchronous SiteLoader event handling | Lars-Dominik Braun | 1 | -6/+13 | |
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more | |||||
2018-05-04 | Share recursive argument parser | Lars-Dominik Braun | 1 | -7/+13 | |
2018-05-04 | Support --browser again for local crawls | Lars-Dominik Braun | 1 | -1/+5 | |
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3 | |||||
2018-05-04 | Add distributed recursive crawls | Lars-Dominik Braun | 1 | -23/+18 | |
2018-05-04 | Add support for recursive crawls | Lars-Dominik Braun | 1 | -2/+15 | |
Only local right now, not distributed. | |||||
2018-05-04 | behavior: Add link extraction script | Lars-Dominik Braun | 1 | -2/+3 | |
2018-05-04 | Move page archiving logic to SinglePageController | Lars-Dominik Braun | 1 | -114/+21 | |
In preparation for recursive crawls. | |||||
2017-12-25 | Increase default body size | Lars-Dominik Braun | 1 | -3/+3 | |
2017-12-24 | Refactor behavior scripts | Lars-Dominik Braun | 1 | -146/+28 | |
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well. | |||||
2017-12-22 | Add simple stats-keeping SiteLoader | Lars-Dominik Braun | 1 | -3/+7 | |
2017-12-20 | Increase hardcoded max timeouts | Lars-Dominik Braun | 1 | -2/+2 | |
We need a better solution for this. Sites loading a lot of responsive images easily need a minute after resizing. | |||||
2017-12-19 | Serialize WARC writing | Lars-Dominik Braun | 1 | -3/+3 | |
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well. | |||||
2017-12-19 | Select default behavior scripts by site URL | Lars-Dominik Braun | 1 | -1/+10 | |
2017-12-17 | Add distributed archiving | Lars-Dominik Braun | 1 | -145/+206 | |
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work. | |||||
2017-12-06 | Start Chrome browser instance | Lars-Dominik Braun | 1 | -44/+49 | |
Unless --browser argument is given. Uses sane settings and a temporary profile directory. | |||||
2017-12-06 | Add flags to disable screenshot/DOM snapshot | Lars-Dominik Braun | 1 | -5/+9 | |
2017-12-03 | Fix UTF-8 encoding name | Lars-Dominik Braun | 1 | -1/+1 | |
HTMLSerializer uses the exact string given in <meta charset=X>, thus it should be with hyphen. | |||||
2017-12-03 | Add page screenshot to WARC | Lars-Dominik Braun | 1 | -0/+14 | |
2017-11-29 | argparse: Add metavar | Lars-Dominik Braun | 1 | -7/+7 | |