Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.4.0 Release #565

Merged
merged 90 commits into from
Jun 8, 2020
Merged

2.4.0 Release #565

merged 90 commits into from
Jun 8, 2020

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Jun 8, 2020

No description provided.

ikreymer and others added 30 commits September 3, 2019 17:28
- support outbackcdx (tinycdxserver) and OpenWayback xmlquery interface (ukwa/ukwa-pywb#2)
- convert xml to cdx iter for exact match
- support prefix match (eg. for fuzzy matching) via chaining prefix query, and lazy urlquery in iterator
…heme to load over http from WebHDFS (ukwa/ukwa-pywb#3)

tests: add basic test for WebHFDSLoader api format
…rom WEBHDFS_USER env var or '&delegation=<token>' from WEBHDFS_TOKEN env var (fixes ukwa/ukwa-pywb#5)
- support extending with custom rewriterapp by setting REWRITER_APP_CLASS
- correctly default to 'config.yaml' if no config file specified
… for ukwa/ukwa-pywb#6)

- 'ba_' - for <base> rewriting
- 'je_' - 'javascript-embed' default for client-side rewriting in wombat

better modifiers for css rewriting (server and client):
- 'ce_' - 'css-embed' for any url() embeds in CSS
- 'cs_' - for css stylesheet @import rewriting/other .css
…ywb#7)

- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
  * allow - all allowed
  * block - allowed in index (as blocked) but content not allowed, served as 451
  * exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
  * clean up wbexception, subclasses provide the status code, message loaded automatically
  * warcserver handles AccessException with json response (now with 451 status)
  * pass status to template to allow custom handling
- 'acl_paths' config can accept a list of files or directories, a file or a directory string
- tests_acl: test collection with acl list, single file, dir
… files via command-line (ukwa/ukwa-pywb#7)

- support as target an auto-collection, where acl file added automatically in ./collections/<coll>/acl/access-rules.aclj
or specifying an .aclj explicitly for more custom configs
- support adding urls and surts, determine if url is already a surt, otherwise canonicalize
acl commands include:
- acl add <target_file_or_coll> <url_or_surt> <access> -- add (or replace) rule for url/surt with access level <access>
- acl remove <target_filr_or_coll> <url_or_surt> -- remove url/surt from target
- acl list <target_file_or_coll> -- list all rules for target
- acl validate <target_file_or_coll> -- ensure sort order is correct, otherwise fix and save
- acl match <target_file_or_coll> <url> -- find matching rule, if any, in target for specified url, or print no match/default rule
- acl importtxt <target_file_or_coll> <filename> -- bulk import of 'excludes.txt' style rules, one url-per-line and add to target
…ywb#7)

- add, importtxt will create an access file if it doesn't exist
- return status code 1 on errors, including if file doesn't exist (for other commands)
- use WbException throughout, only catch HTTPException from werkzeug routing
- only apply refer redirect check for 404 not found errors
- xmlquery index: log unexpected exceptions, treat missing element as not found
…or localization ukwa/ukwa-pywb#11

split init_routes() into init_coll_routes() and make_coll_routes() which retrieves a list of per-collection routes only
…raw' or 'rewritten' mementos (ukwa/ukwa-pywb#12, based on mementoweb/rfc-extensions#6)

- 'enable_prefer: true' in config can be used to enable experimental Memento Prefer behavior
- Prefer header support both redirect and non-redirect style negotiation, extending existing Memento patterns
- Prefer header can be applied both on memento and timegate endpoints
- for redirect style negotiation, Prefer results in a redirect to final memento (if needed), both on Timegate and URL-M (Memento Pattern 2.3)
- for non-redirect style negotiation (Memento Pattern 2.2), Prefer header affects content being served and changes the Content-Location to the canonical representation
- Vary: Prefer and Preference-Applied headers always added to URL-M and Timegate responses
- support Prefer on top-frame url in framed mode, Prefer check runs before custom response
- update Prefer test fixtures to test framed vs frameless and no-mod vs mp_ modifier, all combinations
- fix proxy mode when 'redirect_to_exact=True' is set config, don't redirect in proxy mode
- more general prefer support, moved to content_rewriter to support preference<->mod mappings
- add 'banner-only' preference mapped to bn_ modifier
- proxy mode: allow 'raw' and 'banner-only' preferences
- proxy mode: 'Prefer: rewritten' forced to 'banner-only', served with 'Preference-Applied: banner-only'
- tests: test proxy with prefer header, 'redirect_to_exact=True', add 'banner-only' to Prefer header tests in rewriting mode
…es not start with 2, 4, 5,

to more aggressively check invalid status codes, should fix ukwa/ukwa-pywb#21
- add AppPageNotFound() exception to differntiate app-level not found path from replay content not found
- add custom error messages for collectino not found and static file not found
tests: add tests for collection not found and static file not found errors
- store original wsgi SCRIPT_NAME (before collection path is pushed)
- add 'static_prefix' jinja env global which defaults to original prefix + /static/
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }''
- set 'pywb.host_prefix' via rewriterapp, set 'static_prefix' to absolute url if available (to support proxy mode)
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser
- tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install
- misc fixes: fix typo in banner.html, update gevent api to support latest gevent
- stop checking acl rules linearly if acl key < tld
- use existing rule for same url (at least until date-range checking)
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header
- don't parse json on every aclj line until key prefix matches, resulting in speed boost!
- convert aclj to dict (via cdxobject) only when match is found (disable aggregator source tracking)
- optimize 'wb-manager acl match' command to not load entire file before matching
- acl match <coll_or_file): if 'coll_or_file' exists as file, use it, don't check if auto-collection exist
- fix timemap in 'redirect-to-exact' mode, (ensure timegate redirect condition applies only to top-frame)
- tests: add additional timemap tests, with and without exact redirect
- ensure timemap returns full url-m warcserver supports 'memento_format' param which, if present, specifies
full format to use for memento links in timemap
- memento tests: timemap tests include full url-m, test both framed and frameless timemap responses
yvmarques and others added 29 commits October 31, 2019 17:09
* banner: add banner and localization improvements from ukwa branch:
- show 'view all captures' link if not live
- optional logo
- loc options, if available
- banner options set via window.banner_info in banner.html

localization support: 
- add init_loc() to templateview
- loc available if config options set
- tests: add tests for loading localized messages, override .gitignore to allow test messages.mo
* indexing: restrict POST body appended to query to 16384, avoid reading very large POST requests on indexing
…de (#520)

- if preflight OPTIONS request, respond directly (don't attempt OPTIONS capture lookup)
- if preflight CORS request, ensure response has appropriate CORS headers, even if not captured
- wombat: update to latest wombat with updated Date() fixed timezone in proxy mode
- bump version to 2.4.0rc3
* rewrite fixes:
- dash rewrite fix for fb: when rewriting, match quoted '"dash_prefetched_representation_ids"' as well as w/o quotes,
update tests to ensure rewriting both old and new formats
- wombat update to fix #527: ensure document.write() doesn't accidentally remove end-tag if end-tag was not lowercase (see webrecorder/wombat#21)

* tests: fix recorder cookie filtering test, use https://www.google.com/ for testing

* appveyor: fix appveyor builds
- if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload
* banner: fix banner display for non-framed and proxy mode replay, ensure new 'View All Captures' ancillary section is also shown

* bump version to 2.4.0rc4
* misc fixes (rc 5):
- banner: only auto init banner if not in top-frame (check for no-frame mode and replay url is set)
- index: 'cdx+' fix for use as internal index: if cdx has a warc filename and offset, don't attempt default live web load
- improved self-redirect: avoid www2 -> www redirect altogether, not just for second redirect
- tests: update tests for improved self-redirect checking
- bump version to pywb-2.4.0-rc5
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
Per the code, the key should use an underscore, not a hyphen. It also seems like the value is parsed as a number instead of a string, which then fails with a type error later, so quote it to force it to be a string.

```
$ pywb
2020-03-10 21:06:33,084: [INFO]: Proxy enabled for collection "web"
Traceback (most recent call last):
  File "/tmp/pywb_venv/bin/pywb", line 8, in <module>
    sys.exit(wayback())
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 20, in wayback
    desc='pywb Wayback Machine Server').run()
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 89, in __init__
    self.application = self.load()
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 181, in load
    return FrontEndApp(custom_config=self.extra_config)
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 79, in __init__
    self.init_proxy(config)
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 569, in init_proxy
    if not self.ALL_DIGITS.match(self.proxy_default_timestamp):
TypeError: expected string or buffer
```
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7

* ci: attempt to fix travis build for 27, 35
…563)

* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'

* ci: build python 3.8, ignore 2.7 failures

* reqs: use released ujson for extra_reqs

* hmac: add digestmod, fix for py3.8
…ectly per-memento spec, (#564)

return 404 if not found, return latest memento header. do this by performing actual response lookup,
but then returning the top frame response if succeeded. addresses ukwa/ukwa-pywb#58
@ikreymer ikreymer merged commit c7373ba into master Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.