Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2018-06-25 | warc: Save DOM-/image screenshot as WARC conversion | Lars-Dominik Braun | 7 | -39/+73 | |
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine. | |||||
2018-06-21 | Fix travis test command | Lars-Dominik Braun | 1 | -1/+1 | |
2018-06-21 | Fix a few issues pointed out by pylint | Lars-Dominik Braun | 5 | -22/+10 | |
2018-06-21 | browser: Add a few more tests | Lars-Dominik Braun | 1 | -3/+31 | |
Increase coverage. | |||||
2018-06-20 | Move tests to pytest | Lars-Dominik Braun | 6 | -163/+183 | |
It just seems a little nicer than plain old unittest | |||||
2018-06-20 | Add __slots__ to classes | Lars-Dominik Braun | 5 | -1/+56 | |
This is mainly a quality of life change | |||||
2018-06-20 | Synchronous SiteLoader event handling | Lars-Dominik Braun | 7 | -514/+518 | |
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more | |||||
2018-06-08 | browser: Replace --remote-debugging-socket-fd | Lars-Dominik Braun | 1 | -23/+19 | |
It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir. | |||||
2018-06-03 | behavior: Wrap extract links script in anonymous namespace | Lars-Dominik Braun | 2 | -2/+5 | |
Otherwise it may clash with symbols defined by the page. | |||||
2018-05-20 | behavior: Patreon: Load more comments/replies | Lars-Dominik Braun | 1 | -0/+4 | |
2018-05-20 | behavior: Click Patreon’s “load more” button | Lars-Dominik Braun | 1 | -0/+6 | |
2018-05-05 | Update documentation | Lars-Dominik Braun | 1 | -4/+4 | |
2018-05-05 | Rename command line tools | Lars-Dominik Braun | 3 | -62/+37 | |
Move contrib/ scripts to .tools and add entry points to setup.py, rename crocoite-standalone to crocoite-grab. | |||||
2018-05-05 | Extract only visible and clickable links | Lars-Dominik Braun | 2 | -4/+29 | |
2018-05-05 | contrib: Add WARC merging script | Lars-Dominik Braun | 1 | -0/+70 | |
Very useful for distributed, recursive crawls which create one WARC per page. | |||||
2018-05-04 | sopel: Use recursive, distributed controller | Lars-Dominik Braun | 1 | -2/+7 | |
2018-05-04 | Share recursive argument parser | Lars-Dominik Braun | 2 | -14/+15 | |
2018-05-04 | Support --browser again for local crawls | Lars-Dominik Braun | 2 | -2/+6 | |
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3 | |||||
2018-05-04 | Add distributed recursive crawls | Lars-Dominik Braun | 3 | -31/+91 | |
2018-05-04 | Add support for recursive crawls | Lars-Dominik Braun | 2 | -2/+115 | |
Only local right now, not distributed. | |||||
2018-05-04 | browser: Replace context manager decorator | Lars-Dominik Braun | 1 | -51/+66 | |
Use an actual class that supports multiple invokations. | |||||
2018-05-04 | behavior: Add link extraction script | Lars-Dominik Braun | 4 | -5/+43 | |
2018-05-04 | IRC plugin: Use argparse | Lars-Dominik Braun | 1 | -17/+33 | |
2018-05-04 | Move page archiving logic to SinglePageController | Lars-Dominik Braun | 7 | -160/+211 | |
In preparation for recursive crawls. | |||||
2018-05-04 | Move header unfolding into Item | Lars-Dominik Braun | 2 | -21/+24 | |
2018-05-04 | Fetch request POST body | Lars-Dominik Braun | 2 | -8/+20 | |
If there is any and it was not included in the response already. | |||||
2018-05-04 | Test chained redirects | Lars-Dominik Braun | 1 | -12/+32 | |
2018-04-20 | Add screenshot extraction script to contrib/ | Lars-Dominik Braun | 1 | -0/+54 | |
2018-04-20 | Save screenshot of entire page | Lars-Dominik Braun | 1 | -6/+16 | |
…and not just the current viewport. Due to limitations within Chrome it may be necessary to manually stitch multiple images if the page height exceeds 16k pixels. | |||||
2018-04-14 | Fix base64 body detection | Lars-Dominik Braun | 2 | -10/+10 | |
Broken by commit a21d7332e33a3e47a363004196451721d449e70b | |||||
2018-04-14 | Add timeout to request body fetch | Lars-Dominik Braun | 1 | -3/+4 | |
When something goes wrong, these block the entire grab. | |||||
2018-04-14 | Handle JavaScript dialogs | Lars-Dominik Braun | 1 | -2/+37 | |
alert, confirm and prompt and beforeunload | |||||
2018-04-04 | behavior: Add selector for YouTube. | Lars-Dominik Braun | 1 | -0/+6 | |
2018-03-30 | Add click selectors for Instagram | Lars-Dominik Braun | 1 | -0/+8 | |
Load more comments/images for posts. | |||||
2018-03-29 | Travis: Run tests with pypy3 | Lars-Dominik Braun | 1 | -0/+1 | |
2018-03-29 | Use setuptools | Lars-Dominik Braun | 1 | -1/+1 | |
2018-03-25 | Add Travis CI | Lars-Dominik Braun | 2 | -1/+17 | |
2018-03-25 | Add a few simple tests | Lars-Dominik Braun | 1 | -0/+190 | |
To be expanded, but it’s a start… | |||||
2018-03-25 | Replace deprecated logger.warn | Lars-Dominik Braun | 1 | -3/+3 | |
2018-03-25 | ChromeService: Close listening socket | Lars-Dominik Braun | 1 | -0/+1 | |
We passed it to the child and don’t need it any more. | |||||
2018-03-25 | Move getResponseBody call to Item wrapper | Lars-Dominik Braun | 2 | -13/+21 | |
2018-03-18 | browser: Don’t overwrite LogEntry’s args | Lars-Dominik Braun | 1 | -1/+1 | |
2018-03-18 | behavior: Add click selectors for reddit | Lars-Dominik Braun | 1 | -7/+27 | |
This is slightly obnoxious, since their JavaScript rate-limits clicks to ≤3 Hz and simply ignores everything beyond that. | |||||
2018-03-05 | Add generic click behavior script | Lars-Dominik Braun | 3 | -37/+119 | |
Configureable. Clicks elements matching one (or more) CSS selectors once or multiple times. Currently supported: Facebook, Twitter, Disqus (embedded iframe) | |||||
2018-03-04 | Remove instagram behavior script | Lars-Dominik Braun | 2 | -27/+1 | |
The “load more” button does not exist any more. | |||||
2018-02-23 | README: Add Squidwarc to related projects | Lars-Dominik Braun | 1 | -0/+5 | |
2018-02-22 | irc plugin: Serialize celery operations | Lars-Dominik Braun | 1 | -68/+105 | |
This is a workaround for https://github.com/celery/celery/issues/4480 | |||||
2018-01-20 | behavior: Scroll all DOM elements | Lars-Dominik Braun | 1 | -0/+6 | |
One example is Twitter, which uses a popover div for individual tweets. Scrolling the page won’t scroll that div’s content, which is required to load more replies. | |||||
2018-01-20 | twitter: Expand “more replies” links | Lars-Dominik Braun | 1 | -8/+21 | |
Click them periodically. | |||||
2017-12-27 | Log messages from browser console | Lars-Dominik Braun | 1 | -0/+12 | |