Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2019-03-16 | browser: Raise exception if navigation failed | Lars-Dominik Braun | 1 | -0/+5 | |
Stop early if there’s nothing to do. | |||||
2019-03-16 | browser: Use different UUID for loadingFinished/Failed | Lars-Dominik Braun | 1 | -1/+1 | |
2019-01-07 | Log Chrome’s responses to WARC by default | Lars-Dominik Braun | 1 | -9/+21 | |
We may not be able to reproduce every failure, so logging as much as possible is important to figure out what went wrong. Also, in case a bug is uncovered in the future, we can check the logs and possibly fix it with -errata. | |||||
2019-01-05 | browser: Do not overwrite request data when prefetching | Lars-Dominik Braun | 1 | -2/+0 | |
Needs a testcase. | |||||
2019-01-03 | browser: Turn Item into RequestResponsePair | Lars-Dominik Braun | 1 | -111/+215 | |
Previously Item was just a simple wrapper around Chrome’s Network.* events. This turned out to be quite nasty when testing, so its replacement, RequestResponsePair, does some level of abstraction. This makes testing alot easier, since we now can simply instantiate it without building a proper DevTools event. Should come without any functional changes. | |||||
2018-12-24 | Use f-strings where possible | Lars-Dominik Braun | 1 | -1/+1 | |
Replaces str.format, which is less readable due to its separation of format and arguments. | |||||
2018-12-21 | Parse URLs by default | Lars-Dominik Braun | 1 | -17/+11 | |
Use library yarl (already pulled in by aiohttp). No URL processed should be a string. | |||||
2018-11-24 | browser: Ignore load failures for nonexisting requests | Lars-Dominik Braun | 1 | -2/+3 | |
Fixes None dereference. | |||||
2018-11-22 | controller: Improve idle waiting | Lars-Dominik Braun | 1 | -2/+38 | |
2018-11-19 | Coding style | Lars-Dominik Braun | 1 | -4/+2 | |
Fix a few random issues pointed out by pylint, mainly unused imports. | |||||
2018-11-17 | browser: clearBrowserCookies is supported unconditionally | Lars-Dominik Braun | 1 | -4/+1 | |
canClearBrowserCookies apparently has been removed from protocol 1.3. | |||||
2018-11-14 | Async chrome process startup | Lars-Dominik Braun | 1 | -72/+0 | |
Move it to .devtools. Seems more fitting. | |||||
2018-11-06 | Switch site loader to async DevTools communication | Lars-Dominik Braun | 1 | -107/+126 | |
2018-08-05 | test_browser: Properly handle failed requests | Lars-Dominik Braun | 1 | -5/+4 | |
Fixes test failures. Very fragile code unfortunately. | |||||
2018-08-04 | Properly handle failure to retrieve request body | Lars-Dominik Braun | 1 | -1/+3 | |
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here. | |||||
2018-08-04 | Reintroduce WARC logging | Lars-Dominik Braun | 1 | -18/+30 | |
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration. | |||||
2018-06-25 | warc: Save DOM-/image screenshot as WARC conversion | Lars-Dominik Braun | 1 | -3/+7 | |
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine. | |||||
2018-06-21 | Fix a few issues pointed out by pylint | Lars-Dominik Braun | 1 | -6/+3 | |
2018-06-20 | Move tests to pytest | Lars-Dominik Braun | 1 | -162/+0 | |
It just seems a little nicer than plain old unittest | |||||
2018-06-20 | Add __slots__ to classes | Lars-Dominik Braun | 1 | -0/+11 | |
This is mainly a quality of life change | |||||
2018-06-20 | Synchronous SiteLoader event handling | Lars-Dominik Braun | 1 | -252/+165 | |
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more | |||||
2018-06-08 | browser: Replace --remote-debugging-socket-fd | Lars-Dominik Braun | 1 | -23/+19 | |
It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir. | |||||
2018-05-04 | Support --browser again for local crawls | Lars-Dominik Braun | 1 | -1/+1 | |
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3 | |||||
2018-05-04 | browser: Replace context manager decorator | Lars-Dominik Braun | 1 | -51/+66 | |
Use an actual class that supports multiple invokations. | |||||
2018-05-04 | Move header unfolding into Item | Lars-Dominik Braun | 1 | -0/+22 | |
2018-05-04 | Fetch request POST body | Lars-Dominik Braun | 1 | -1/+15 | |
If there is any and it was not included in the response already. | |||||
2018-05-04 | Test chained redirects | Lars-Dominik Braun | 1 | -12/+32 | |
2018-04-14 | Fix base64 body detection | Lars-Dominik Braun | 1 | -9/+9 | |
Broken by commit a21d7332e33a3e47a363004196451721d449e70b | |||||
2018-04-14 | Add timeout to request body fetch | Lars-Dominik Braun | 1 | -3/+4 | |
When something goes wrong, these block the entire grab. | |||||
2018-04-14 | Handle JavaScript dialogs | Lars-Dominik Braun | 1 | -2/+37 | |
alert, confirm and prompt and beforeunload | |||||
2018-03-25 | Add a few simple tests | Lars-Dominik Braun | 1 | -0/+190 | |
To be expanded, but it’s a start… | |||||
2018-03-25 | Replace deprecated logger.warn | Lars-Dominik Braun | 1 | -3/+3 | |
2018-03-25 | ChromeService: Close listening socket | Lars-Dominik Braun | 1 | -0/+1 | |
We passed it to the child and don’t need it any more. | |||||
2018-03-25 | Move getResponseBody call to Item wrapper | Lars-Dominik Braun | 1 | -2/+19 | |
2018-03-18 | browser: Don’t overwrite LogEntry’s args | Lars-Dominik Braun | 1 | -1/+1 | |
2017-12-27 | Log messages from browser console | Lars-Dominik Braun | 1 | -0/+12 | |
2017-12-23 | Set fake finished response for redirects | Lars-Dominik Braun | 1 | -1/+4 | |
Fixes bcfbdd9b45b7e872ee77e1366197443d855d8c7c | |||||
2017-12-23 | Drain tab event queue before stopping | Lars-Dominik Braun | 1 | -0/+2 | |
2017-12-22 | Add simple stats-keeping SiteLoader | Lars-Dominik Braun | 1 | -2/+33 | |
2017-12-22 | SiteLoader: Save entire finished response | Lars-Dominik Braun | 1 | -2/+9 | |
2017-12-17 | Add distributed archiving | Lars-Dominik Braun | 1 | -6/+15 | |
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work. | |||||
2017-12-06 | Start Chrome browser instance | Lars-Dominik Braun | 1 | -0/+52 | |
Unless --browser argument is given. Uses sane settings and a temporary profile directory. | |||||
2017-11-29 | Add missing timestamp to response data for redirects | Lars-Dominik Braun | 1 | -1/+1 | |
Fixes 6f628ca24ac2b243dd4a611ff1ecff2d35aaa019 | |||||
2017-11-29 | Use Chrome’s timestamps as WARC-Date | Lars-Dominik Braun | 1 | -8/+8 | |
2017-11-29 | Refactoring | Lars-Dominik Braun | 1 | -0/+209 | |
Reusable browser communication and WARC writing. |