summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-03-25Replace deprecated logger.warnLars-Dominik Braun1-3/+3
2018-03-25ChromeService: Close listening socketLars-Dominik Braun1-0/+1
We passed it to the child and don’t need it any more.
2018-03-25Move getResponseBody call to Item wrapperLars-Dominik Braun2-13/+21
2018-03-18browser: Don’t overwrite LogEntry’s argsLars-Dominik Braun1-1/+1
2018-03-18behavior: Add click selectors for redditLars-Dominik Braun1-7/+27
This is slightly obnoxious, since their JavaScript rate-limits clicks to ≤3 Hz and simply ignores everything beyond that.
2018-03-05Add generic click behavior scriptLars-Dominik Braun3-37/+119
Configureable. Clicks elements matching one (or more) CSS selectors once or multiple times. Currently supported: Facebook, Twitter, Disqus (embedded iframe)
2018-03-04Remove instagram behavior scriptLars-Dominik Braun2-27/+1
The “load more” button does not exist any more.
2018-02-23README: Add Squidwarc to related projectsLars-Dominik Braun1-0/+5
2018-02-22irc plugin: Serialize celery operationsLars-Dominik Braun1-68/+105
This is a workaround for https://github.com/celery/celery/issues/4480
2018-01-20behavior: Scroll all DOM elementsLars-Dominik Braun1-0/+6
One example is Twitter, which uses a popover div for individual tweets. Scrolling the page won’t scroll that div’s content, which is required to load more replies.
2018-01-20twitter: Expand “more replies” linksLars-Dominik Braun1-8/+21
Click them periodically.
2017-12-27Log messages from browser consoleLars-Dominik Braun1-0/+12
2017-12-25Increase default body sizeLars-Dominik Braun4-9/+38
2017-12-24Refactor behavior scriptsLars-Dominik Braun8-194/+302
No functional changes, just cleanup. Replaces onload and onsnapshot events. Move screen metric emulation, DOM snapshots and screenshots here as well.
2017-12-23Set fake finished response for redirectsLars-Dominik Braun1-1/+4
Fixes bcfbdd9b45b7e872ee77e1366197443d855d8c7c
2017-12-23Drain tab event queue before stoppingLars-Dominik Braun1-0/+2
2017-12-22Add simple stats-keeping SiteLoaderLars-Dominik Braun4-10/+60
2017-12-22SiteLoader: Save entire finished responseLars-Dominik Braun1-2/+9
2017-12-22Don’t write WARC record if body cannot be retrievedLars-Dominik Braun1-19/+48
+refactoring.
2017-12-20Increase hardcoded max timeoutsLars-Dominik Braun1-2/+2
We need a better solution for this. Sites loading a lot of responsive images easily need a minute after resizing.
2017-12-20Fix HTTP headers using the same key more than onceLars-Dominik Braun1-2/+15
This is an undocumented DevTools feature.
2017-12-19Serialize WARC writingLars-Dominik Braun2-3/+38
Logger and SiteWriter both access .write_record() concurrently, which can corrupt WARC files. Move the writer to its own thread and decouple it with a queue. Since we’re probably I/O-bound this may speed up writeback as well.
2017-12-19Select default behavior scripts by site URLLars-Dominik Braun5-3/+75
2017-12-18README: Add related projectLars-Dominik Braun1-0/+9
2017-12-17Add Twitter fixupsLars-Dominik Braun1-0/+17
2017-12-17Extend READMELars-Dominik Braun1-10/+25
2017-12-17Don’t fetch redirected request bodyLars-Dominik Braun2-8/+13
We can’t do that safely due to a race-condition.
2017-12-17Add distributed archivingLars-Dominik Braun5-154/+407
Using celery. Also adds a plugin for the IRC bot sopel. Code still needs some love, but it should work.
2017-12-06Start Chrome browser instanceLars-Dominik Braun2-44/+101
Unless --browser argument is given. Uses sane settings and a temporary profile directory.
2017-12-06Add flags to disable screenshot/DOM snapshotLars-Dominik Braun1-5/+9
2017-12-03Add note about Range: requestsLars-Dominik Braun1-0/+2
2017-12-03Fix UTF-8 encoding nameLars-Dominik Braun1-1/+1
HTMLSerializer uses the exact string given in <meta charset=X>, thus it should be with hyphen.
2017-12-03Add page screenshot to WARCLars-Dominik Braun1-0/+14
2017-11-29Add missing timestamp to response data for redirectsLars-Dominik Braun1-1/+1
Fixes 6f628ca24ac2b243dd4a611ff1ecff2d35aaa019
2017-11-29argparse: Add metavarLars-Dominik Braun1-7/+7
2017-11-29Use Chrome’s timestamps as WARC-DateLars-Dominik Braun2-8/+14
2017-11-29RefactoringLars-Dominik Braun5-403/+571
Reusable browser communication and WARC writing.
2017-11-26DOM snapshot: Generate valid HTML5Lars-Dominik Braun2-9/+31
Some tags are “void”, i.e. cannot contain contents and don’t have a closing tag.
2017-11-25Ignore duplicate URLs when saving DOM snapshotLars-Dominik Braun1-1/+10
2017-11-25Workaround broken device metrics resetLars-Dominik Braun1-1/+3
Apparently neither width=0, height=0 nor clearDeviceMetricsOverride() do what they should, so manually reset to 1080p screen size.
2017-11-25Strip on* HTML attributesLars-Dominik Braun2-1/+111
They can carry JavaScript as well and should not be allowed for DOM snapshots.
2017-11-25Rename --run-before-snapshot and document --on* optionsLars-Dominik Braun2-4/+20
2017-11-24DOM snapshot: Save frames/subdocuments as wellLars-Dominik Braun1-13/+36
Request all subdocuments with pierce=True, split the result and save each document. Playback with pywb works, because timestamps of the snapshots are close to each other.
2017-11-24Reset device metricsLars-Dominik Braun1-2/+5
2017-11-24Save onsnapshot script to WARCLars-Dominik Braun1-4/+8
2017-11-22Make <canvas> static before DOM snapshotLars-Dominik Braun3-9/+31
Use --run-before-snapshot=canvas-snapshot.js. Replaces <canvas> with image snapshot. We could use .captureStream() as well.
2017-11-22Emulate different screen sizesLars-Dominik Braun2-3/+25
Causes the browser to load CSS assets and <img> srcset, for example.
2017-11-22Add example fixups for InstagramLars-Dominik Braun3-3/+32
2017-11-21Move base64 metadata into WARC headerLars-Dominik Braun1-1/+1
2017-11-21Graceful page load timeoutLars-Dominik Braun2-11/+32
Stop scrolling script, wait for remaining resources to load.