Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2018-09-25 | Log extracted links | Lars-Dominik Braun | 2 | -2/+25 | |
2018-08-21 | Remove celery and recursion | Lars-Dominik Braun | 3 | -317/+23 | |
Gonna rewrite that properly. | |||||
2018-08-17 | behavior: Load more comments from Facebook | Lars-Dominik Braun | 1 | -0/+4 | |
2018-08-05 | test_browser: Properly handle failed requests | Lars-Dominik Braun | 2 | -15/+14 | |
Fixes test failures. Very fragile code unfortunately. | |||||
2018-08-04 | Properly handle failure to retrieve request body | Lars-Dominik Braun | 3 | -5/+50 | |
Just truncate the WARC record like we do with responses. Also add a few tests, but they’re not covering the call to getRequestPostData. Not sure what we have to do here. | |||||
2018-08-04 | Reference warcinfo record in every other record | Lars-Dominik Braun | 1 | -18/+30 | |
2018-08-04 | Add package information to warcinfo | Lars-Dominik Braun | 3 | -8/+65 | |
Change warcinfo record format to JSON (this is permitted by the specs) and add Python version, dependencies and their versions as well as file hashes. This should give us enough information to figure out the exact environment used to create the WARC. | |||||
2018-08-04 | Reintroduce WARC logging | Lars-Dominik Braun | 9 | -76/+337 | |
Commit 7730e0d64ec895091a0dd7eb0e3c6ce2ed02d981 removed logging to WARC files. Add it again, but with a different implementation.. Credits to structlog for inspiration. | |||||
2018-06-25 | browser: Fix testcase race condition | Lars-Dominik Braun | 1 | -0/+4 | |
2018-06-25 | warc: Add metadata to truncated records | Lars-Dominik Braun | 1 | -22/+28 | |
Specifically for a) redirects (body missing) b) bodies larger than size limit and c) whenever we couldn’t fetch the response body for whatever reason. We gave it our best shot, but still failed miserably. Future generations will certainly appreciate that. Eh, maybe. Hopefully. Will they? | |||||
2018-06-25 | warc: Save DOM-/image screenshot as WARC conversion | Lars-Dominik Braun | 6 | -37/+72 | |
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine. | |||||
2018-06-21 | Fix a few issues pointed out by pylint | Lars-Dominik Braun | 5 | -22/+10 | |
2018-06-21 | browser: Add a few more tests | Lars-Dominik Braun | 1 | -3/+31 | |
Increase coverage. | |||||
2018-06-20 | Move tests to pytest | Lars-Dominik Braun | 2 | -162/+177 | |
It just seems a little nicer than plain old unittest | |||||
2018-06-20 | Add __slots__ to classes | Lars-Dominik Braun | 5 | -1/+56 | |
This is mainly a quality of life change | |||||
2018-06-20 | Synchronous SiteLoader event handling | Lars-Dominik Braun | 6 | -509/+514 | |
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more | |||||
2018-06-08 | browser: Replace --remote-debugging-socket-fd | Lars-Dominik Braun | 1 | -23/+19 | |
It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir. | |||||
2018-06-03 | behavior: Wrap extract links script in anonymous namespace | Lars-Dominik Braun | 2 | -2/+5 | |
Otherwise it may clash with symbols defined by the page. | |||||
2018-05-20 | behavior: Patreon: Load more comments/replies | Lars-Dominik Braun | 1 | -0/+4 | |
2018-05-20 | behavior: Click Patreon’s “load more” button | Lars-Dominik Braun | 1 | -0/+6 | |
2018-05-05 | Rename command line tools | Lars-Dominik Braun | 1 | -0/+97 | |
Move contrib/ scripts to .tools and add entry points to setup.py, rename crocoite-standalone to crocoite-grab. | |||||
2018-05-05 | Extract only visible and clickable links | Lars-Dominik Braun | 2 | -4/+29 | |
2018-05-04 | Share recursive argument parser | Lars-Dominik Braun | 2 | -14/+15 | |
2018-05-04 | Support --browser again for local crawls | Lars-Dominik Braun | 2 | -2/+6 | |
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3 | |||||
2018-05-04 | Add distributed recursive crawls | Lars-Dominik Braun | 3 | -31/+91 | |
2018-05-04 | Add support for recursive crawls | Lars-Dominik Braun | 2 | -2/+115 | |
Only local right now, not distributed. | |||||
2018-05-04 | browser: Replace context manager decorator | Lars-Dominik Braun | 1 | -51/+66 | |
Use an actual class that supports multiple invokations. | |||||
2018-05-04 | behavior: Add link extraction script | Lars-Dominik Braun | 4 | -5/+43 | |
2018-05-04 | Move page archiving logic to SinglePageController | Lars-Dominik Braun | 5 | -144/+198 | |
In preparation for recursive crawls. | |||||
2018-05-04 | Move header unfolding into Item | Lars-Dominik Braun | 2 | -21/+24 | |
2018-05-04 | Fetch request POST body | Lars-Dominik Braun | 2 | -8/+20 | |
If there is any and it was not included in the response already. | |||||
2018-05-04 | Test chained redirects | Lars-Dominik Braun | 1 | -12/+32 | |
2018-04-20 | Save screenshot of entire page | Lars-Dominik Braun | 1 | -6/+16 | |
…and not just the current viewport. Due to limitations within Chrome it may be necessary to manually stitch multiple images if the page height exceeds 16k pixels. | |||||
2018-04-14 | Fix base64 body detection | Lars-Dominik Braun | 2 | -10/+10 | |
Broken by commit a21d7332e33a3e47a363004196451721d449e70b | |||||
2018-04-14 | Add timeout to request body fetch | Lars-Dominik Braun | 1 | -3/+4 | |
When something goes wrong, these block the entire grab. | |||||
2018-04-14 | Handle JavaScript dialogs | Lars-Dominik Braun | 1 | -2/+37 | |
alert, confirm and prompt and beforeunload | |||||
2018-04-04 | behavior: Add selector for YouTube. | Lars-Dominik Braun | 1 | -0/+6 | |
2018-03-30 | Add click selectors for Instagram | Lars-Dominik Braun | 1 | -0/+8 | |
Load more comments/images for posts. | |||||
2018-03-25 | Add a few simple tests | Lars-Dominik Braun | 1 | -0/+190 | |
To be expanded, but it’s a start… | |||||
2018-03-25 | Replace deprecated logger.warn | Lars-Dominik Braun | 1 | -3/+3 | |
2018-03-25 | ChromeService: Close listening socket | Lars-Dominik Braun | 1 | -0/+1 | |
We passed it to the child and don’t need it any more. | |||||
2018-03-25 | Move getResponseBody call to Item wrapper | Lars-Dominik Braun | 2 | -13/+21 | |
2018-03-18 | browser: Don’t overwrite LogEntry’s args | Lars-Dominik Braun | 1 | -1/+1 | |
2018-03-18 | behavior: Add click selectors for reddit | Lars-Dominik Braun | 1 | -7/+27 | |
This is slightly obnoxious, since their JavaScript rate-limits clicks to ≤3 Hz and simply ignores everything beyond that. | |||||
2018-03-05 | Add generic click behavior script | Lars-Dominik Braun | 3 | -37/+119 | |
Configureable. Clicks elements matching one (or more) CSS selectors once or multiple times. Currently supported: Facebook, Twitter, Disqus (embedded iframe) | |||||
2018-03-04 | Remove instagram behavior script | Lars-Dominik Braun | 2 | -27/+1 | |
The “load more” button does not exist any more. | |||||
2018-01-20 | behavior: Scroll all DOM elements | Lars-Dominik Braun | 1 | -0/+6 | |
One example is Twitter, which uses a popover div for individual tweets. Scrolling the page won’t scroll that div’s content, which is required to load more replies. | |||||
2018-01-20 | twitter: Expand “more replies” links | Lars-Dominik Braun | 1 | -8/+21 | |
Click them periodically. | |||||
2017-12-27 | Log messages from browser console | Lars-Dominik Braun | 1 | -0/+12 | |
2017-12-25 | Increase default body size | Lars-Dominik Braun | 3 | -5/+34 | |