Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to Record and Replay X and Instagram #921

Open
szkiwr opened this issue Nov 19, 2024 · 1 comment
Open

Failure to Record and Replay X and Instagram #921

szkiwr opened this issue Nov 19, 2024 · 1 comment

Comments

@szkiwr
Copy link

szkiwr commented Nov 19, 2024

Describe the bug

When archiving specific websites (such as X and Instagram) in proxy mode, the replay results in issues. On X, the content of posts fails to load, and on Instagram, the text becomes garbled. These problems might be related to brotli or zstandard, but the exact cause is unclear. There were no errors in the logs.

Steps to reproduce the bug

  1. Execute this command
wayback --record --live -a --auto-interval 10 -p 9090
  1. Configure Chrome's proxy settings to localhost and port 9090.
  2. Access websites such as Instagram.
  3. Open the archived site in replay mode.

Screenshots

Screenshot from 2024-11-19 23-32-46

config.yaml

Here is the configuration used during the recording:

collections_root: collections

# Per-Collection Paths
archive_paths: archive
index_paths: indexes
acl_paths: acl
static_path: static

default_access: allow

templates_dir: templates

# Template HTML
banner_html: banner.html
custom_banner_html: custom_banner.html
head_insert_html: head_insert.html
frame_insert_html: frame_insert.html

base_html: base.html
header_html: header.html
footer_html: footer.html
head_html: head.html

query_html: query.html
search_html: search.html
not_found_html: not_found.html

home_html: index.html
error_html: error.html

proxy_cert_download_html: proxy_cert_download.html
proxy_select_html: proxy_select.html

# Info JSON
info_json: collinfo.json

# HTML Templates List
html_templates:
    - banner_html
    - custom_banner_html
    - head_insert_html
    - frame_insert_html

    - query_html
    - search_html
    - not_found_html

    - home_html

    - base_html
    - header_html
    - head_html
    - footer_html

    - error_html
    - proxy_cert_download_html
    - proxy_select_html

# Other Settings
enable_memento: true
enable_auto_fetch: true
rules_config: pkg://pywb/rules.yaml
#redirect_to_exact: true

# Proxy Settings
proxy:
  coll: my-web-archive
  ca_name: pywb HTTPS Proxy CA
  ca_file_cache: ./proxy-certs/pywb-ca.pem
  recording: true
  enable_banner: true
  enable_content_rewrite: true
  default_timestamp: ''

Environment

  • OS: Ubuntu 20.04.6 LTS 64-bit
  • Browser: Chrome 129.0.6668.100
  • Python Version: 3.12.7 (via pyenv)
  • Pywb Version: 2.8.3
@ikreymer
Copy link
Member

Yes, you're right that the issue is pywb does not handle zstd encoding which appears to be used by default by Instagram. In this case, the solution would be to remove it from Accept-Encoding request header.

Unfortunately, we don't have the bandwidth to update this part of pywb at the moment, so not sure when that may happen.

If you are looking to archive these sites, we recommend using our ArchiveWeb.page extension or desktop app, which should handle these sites pretty well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants