summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-06-25warc: Save DOM-/image screenshot as WARC conversionLars-Dominik Braun7-39/+73
Judging from the docs this is the proper way to store these resources. Enable both for the IRC bot by default, since they won’t interfere with IA’s wayback machine.
2018-06-21Fix travis test commandLars-Dominik Braun1-1/+1
2018-06-21Fix a few issues pointed out by pylintLars-Dominik Braun5-22/+10
2018-06-21browser: Add a few more testsLars-Dominik Braun1-3/+31
Increase coverage.
2018-06-20Move tests to pytestLars-Dominik Braun6-163/+183
It just seems a little nicer than plain old unittest
2018-06-20Add __slots__ to classesLars-Dominik Braun5-1/+56
This is mainly a quality of life change
2018-06-20Synchronous SiteLoader event handlingLars-Dominik Braun7-514/+518
Previously a browser crash stalled the entire grab, since events from pychrome were handled asynchronously in a different thread and exceptions were not propagated to the main thread. Now all browser events are stored in a queue and processed by the main thread, allowing us to handle browser crashes gracefully (more or less). This made the following additional changes necessary: - Clear separation between producer (browser) and consumer (WARC, stats, …) - Behavior scripts now yield events as well, instead of accessing the WARC writer - WARC logging was removed (for now) and WARC writer does not require serialization any more
2018-06-08browser: Replace --remote-debugging-socket-fdLars-Dominik Braun1-23/+19
It was replaced by --remote-debugging-pipe in version 67. pychrome does not support that out of the box, so instead we’ll let Chrome choose its own port and poll a file in its user-data-dir.
2018-06-03behavior: Wrap extract links script in anonymous namespaceLars-Dominik Braun2-2/+5
Otherwise it may clash with symbols defined by the page.
2018-05-20behavior: Patreon: Load more comments/repliesLars-Dominik Braun1-0/+4
2018-05-20behavior: Click Patreon’s “load more” buttonLars-Dominik Braun1-0/+6
2018-05-05Update documentationLars-Dominik Braun1-4/+4
2018-05-05Rename command line toolsLars-Dominik Braun3-62/+37
Move contrib/ scripts to .tools and add entry points to setup.py, rename crocoite-standalone to crocoite-grab.
2018-05-05Extract only visible and clickable linksLars-Dominik Braun2-4/+29
2018-05-05contrib: Add WARC merging scriptLars-Dominik Braun1-0/+70
Very useful for distributed, recursive crawls which create one WARC per page.
2018-05-04sopel: Use recursive, distributed controllerLars-Dominik Braun1-2/+7
2018-05-04Share recursive argument parserLars-Dominik Braun2-14/+15
2018-05-04Support --browser again for local crawlsLars-Dominik Braun2-2/+6
Broken by commit 75019eac4545bb2e8b90033834e91beef614cdf3
2018-05-04Add distributed recursive crawlsLars-Dominik Braun3-31/+91
2018-05-04Add support for recursive crawlsLars-Dominik Braun2-2/+115
Only local right now, not distributed.
2018-05-04browser: Replace context manager decoratorLars-Dominik Braun1-51/+66
Use an actual class that supports multiple invokations.
2018-05-04behavior: Add link extraction scriptLars-Dominik Braun4-5/+43
2018-05-04IRC plugin: Use argparseLars-Dominik Braun1-17/+33
2018-05-04Move page archiving logic to SinglePageControllerLars-Dominik Braun7-160/+211
In preparation for recursive crawls.
2018-05-04Move header unfolding into ItemLars-Dominik Braun2-21/+24
2018-05-04Fetch request POST bodyLars-Dominik Braun2-8/+20
If there is any and it was not included in the response already.
2018-05-04Test chained redirectsLars-Dominik Braun1-12/+32
2018-04-20Add screenshot extraction script to contrib/Lars-Dominik Braun1-0/+54
2018-04-20Save screenshot of entire pageLars-Dominik Braun1-6/+16
…and not just the current viewport. Due to limitations within Chrome it may be necessary to manually stitch multiple images if the page height exceeds 16k pixels.
2018-04-14Fix base64 body detectionLars-Dominik Braun2-10/+10
Broken by commit a21d7332e33a3e47a363004196451721d449e70b
2018-04-14Add timeout to request body fetchLars-Dominik Braun1-3/+4
When something goes wrong, these block the entire grab.
2018-04-14Handle JavaScript dialogsLars-Dominik Braun1-2/+37
alert, confirm and prompt and beforeunload
2018-04-04behavior: Add selector for YouTube.Lars-Dominik Braun1-0/+6
2018-03-30Add click selectors for InstagramLars-Dominik Braun1-0/+8
Load more comments/images for posts.
2018-03-29Travis: Run tests with pypy3Lars-Dominik Braun1-0/+1
2018-03-29Use setuptoolsLars-Dominik Braun1-1/+1
2018-03-25Add Travis CILars-Dominik Braun2-1/+17
2018-03-25Add a few simple testsLars-Dominik Braun1-0/+190
To be expanded, but it’s a start…
2018-03-25Replace deprecated logger.warnLars-Dominik Braun1-3/+3
2018-03-25ChromeService: Close listening socketLars-Dominik Braun1-0/+1
We passed it to the child and don’t need it any more.
2018-03-25Move getResponseBody call to Item wrapperLars-Dominik Braun2-13/+21
2018-03-18browser: Don’t overwrite LogEntry’s argsLars-Dominik Braun1-1/+1
2018-03-18behavior: Add click selectors for redditLars-Dominik Braun1-7/+27
This is slightly obnoxious, since their JavaScript rate-limits clicks to ≤3 Hz and simply ignores everything beyond that.
2018-03-05Add generic click behavior scriptLars-Dominik Braun3-37/+119
Configureable. Clicks elements matching one (or more) CSS selectors once or multiple times. Currently supported: Facebook, Twitter, Disqus (embedded iframe)
2018-03-04Remove instagram behavior scriptLars-Dominik Braun2-27/+1
The “load more” button does not exist any more.
2018-02-23README: Add Squidwarc to related projectsLars-Dominik Braun1-0/+5
2018-02-22irc plugin: Serialize celery operationsLars-Dominik Braun1-68/+105
This is a workaround for https://github.com/celery/celery/issues/4480
2018-01-20behavior: Scroll all DOM elementsLars-Dominik Braun1-0/+6
One example is Twitter, which uses a popover div for individual tweets. Scrolling the page won’t scroll that div’s content, which is required to load more replies.
2018-01-20twitter: Expand “more replies” linksLars-Dominik Braun1-8/+21
Click them periodically.
2017-12-27Log messages from browser consoleLars-Dominik Braun1-0/+12