Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2018-11-22 | controller: Improve idle waiting | Lars-Dominik Braun | 3 | -19/+89 | |
2018-11-19 | controller: Add parameters to warcinfo | Lars-Dominik Braun | 1 | -0/+7 | |
Add parameters the grab was run with, so we can actually reproduce a run. | |||||
2018-11-19 | Coding style | Lars-Dominik Braun | 12 | -58/+44 | |
Fix a few random issues pointed out by pylint, mainly unused imports. | |||||
2018-11-17 | html: Add tests for tree walker | Lars-Dominik Braun | 1 | -1/+23 | |
2018-11-17 | logger: Add more tests | Lars-Dominik Braun | 2 | -3/+25 | |
2018-11-17 | browser: Add tests for header deserialization | Lars-Dominik Braun | 1 | -0/+39 | |
2018-11-17 | devtools: Update browser flags | Lars-Dominik Braun | 1 | -0/+12 | |
Add a few more that seem reasonable. | |||||
2018-11-17 | browser: clearBrowserCookies is supported unconditionally | Lars-Dominik Braun | 1 | -4/+1 | |
canClearBrowserCookies apparently has been removed from protocol 1.3. | |||||
2018-11-17 | tools: Add original HTTP header to revisit record | Lars-Dominik Braun | 2 | -11/+13 | |
The payloads may be the same, but the headers are usually not. | |||||
2018-11-17 | click: Add gab.ai | Lars-Dominik Braun | 1 | -0/+10 | |
Load more posts on profile page and more comments and replies on individual post pages. | |||||
2018-11-14 | Async chrome process startup | Lars-Dominik Braun | 6 | -157/+161 | |
Move it to .devtools. Seems more fitting. | |||||
2018-11-10 | tools: Fix WARC merging | Lars-Dominik Braun | 2 | -18/+205 | |
WARC-Target-URI was taken from the previous record, even if the URI was different. This essentially removes the revisited URL from the archive. Also add a few tests. And boy, warcio is a mess. | |||||
2018-11-08 | devtools: Disable websocket pings to Chrome | Lars-Dominik Braun | 2 | -1/+12 | |
Chrome does not like that. | |||||
2018-11-06 | Switch single mode to asyncio | Lars-Dominik Braun | 5 | -175/+141 | |
This is a direct port to asyncio without any design changes. These need to happen in further refinements. Fixes issue #1. | |||||
2018-11-06 | Switch site loader to async DevTools communication | Lars-Dominik Braun | 2 | -229/+236 | |
2018-11-06 | Add simple asyncio-based DevTool communication | Lars-Dominik Braun | 2 | -0/+406 | |
Inspired by pychrome/aiochrome, but includes crash handling and async get() instead of callbacks. | |||||
2018-11-03 | html: Add tests for tag/attribute stripping | Lars-Dominik Braun | 1 | -0/+38 | |
2018-10-30 | recursive: Actually stop the grab when canceled | Lars-Dominik Braun | 1 | -1/+3 | |
This change was lost during the merge of 958563a3602780b48599c27acf212139c2e6904d. | |||||
2018-10-30 | Reduce idle wait time after stopping page | Lars-Dominik Braun | 1 | -4/+4 | |
2018-10-30 | Increase default timeouts | Lars-Dominik Braun | 1 | -2/+2 | |
These are more sane than the previous super-short defaults. Obviously this will slow down recursive crawls. | |||||
2018-10-23 | single: Set and recursive: check exit status | Lars-Dominik Braun | 2 | -12/+34 | |
Use exit status to signal something is wrong. Check it within recursive, increment crashed counter and do not move the resulting WARC, it might be broken. | |||||
2018-10-22 | behavior: Unload script only if the handle is valid | Lars-Dominik Braun | 1 | -2/+4 | |
For some reason with Google Chrome 70 this is not the case any more. | |||||
2018-10-14 | irc: Add PoC dashboard | Lars-Dominik Braun | 3 | -16/+119 | |
Using websockets, vue and bulma. | |||||
2018-10-14 | irc: Graceful bot shutdown | Lars-Dominik Braun | 3 | -16/+110 | |
Wait for remaining jobs to finish without accepting new ones, but still allow some interaction with the bot (status/revoke). | |||||
2018-10-11 | recursive: Gracefully shut down on SIGINT/TERM | Lars-Dominik Braun | 2 | -4/+18 | |
2018-10-10 | Add timezone to logger dates | Lars-Dominik Braun | 1 | -1/+3 | |
UTC everywhere. Make that clear. | |||||
2018-10-03 | controller: Depth limit does not work with i>1 | Lars-Dominik Braun | 1 | -1/+3 | |
No easy way to fix this, so just limit to [0, 1] for now. | |||||
2018-10-03 | irc: Fix mode parsing | Lars-Dominik Braun | 2 | -7/+37 | |
Ignore unsupported modes, add tests. | |||||
2018-10-02 | irc: Refactoring/beautification | Lars-Dominik Braun | 2 | -101/+266 | |
Add logging, split bot into abstract bot implementation and actual chromebot implementation, move some reusable checks into decorators. | |||||
2018-09-29 | Add documentation | Lars-Dominik Braun | 2 | -3/+9 | |
For -recursive and -irc | |||||
2018-09-29 | irc: Limit number of processes spawned | Lars-Dominik Braun | 2 | -21/+25 | |
2018-09-29 | Add simple IRC bot | Lars-Dominik Braun | 2 | -0/+273 | |
chromebot is back! Dropping sopel, because it does not work well with asyncio. | |||||
2018-09-25 | Prevent recursing into arbitrary schemes | Lars-Dominik Braun | 1 | -1/+9 | |
HTTP(S) only. | |||||
2018-09-25 | Parallelize recursive grabs | Lars-Dominik Braun | 2 | -5/+17 | |
❤️ asyncio. | |||||
2018-09-25 | Add recursive controller | Lars-Dominik Braun | 2 | -1/+169 | |
Simple and sequential. | |||||
2018-09-25 | Immediately flush logger | Lars-Dominik Braun | 1 | -0/+2 | |
Consumers can read the latest gossip faster now. | |||||
2018-09-25 | Log extracted links | Lars-Dominik Braun | 2 | -2/+25 | |
2018-08-21 | Remove celery and recursion | Lars-Dominik Braun | 3 | -317/+23 | |
Gonna rewrite that properly. | |||||
2018-08-17 | behavior: Load more comments from Facebook | Lars-Dominik Braun | 1 | -0/+4 | |
2018-08-05 | test_browser: Properly handle failed requests | Lars-Dominik Braun | 2 | -15/+14 | |
Fixes test failures. Very fragile code unfortunately. | |||||
2018-08-04 | Properly handle failure to retrieve request body | Lars-Dominik Braun | 3 | -5/+50 | |
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here. | |||||
2018-08-04 | Reference warcinfo record in every other record | Lars-Dominik Braun | 1 | -18/+30 | |
2018-08-04 | Add package information to warcinfo | Lars-Dominik Braun | 3 | -8/+65 | |
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC. | |||||
2018-08-04 | Reintroduce WARC logging | Lars-Dominik Braun | 9 | -76/+337 | |
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration. | |||||
2018-06-25 | browser: Fix testcase race condition | Lars-Dominik Braun | 1 | -0/+4 | |
2018-06-25 | warc: Add metadata to truncated records | Lars-Dominik Braun | 1 | -22/+28 | |
Specifically for a) redirects (body missing) b) bodies larger than size limit and c) whenever we couldn’t fetch the response body for whatever reason. We gave it our best shot, but still failed miserably. Future generations will certainly appreciate that. Eh, maybe. Hopefully. Will they? | |||||
2018-06-25 | warc: Save DOM-/image screenshot as WARC conversion | Lars-Dominik Braun | 6 | -37/+72 | |
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine. | |||||
2018-06-21 | Fix a few issues pointed out by pylint | Lars-Dominik Braun | 5 | -22/+10 | |
2018-06-21 | browser: Add a few more tests | Lars-Dominik Braun | 1 | -3/+31 | |
Increase coverage. | |||||
2018-06-20 | Move tests to pytest | Lars-Dominik Braun | 2 | -162/+177 | |
It just seems a little nicer than plain old unittest |