Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDX API handling of excluded from the Wayback Machine URLs #157

Closed
h6197627 opened this issue Feb 17, 2022 · 3 comments · Fixed by #158
Closed

CDX API handling of excluded from the Wayback Machine URLs #157

h6197627 opened this issue Feb 17, 2022 · 3 comments · Fixed by #158
Assignees
Labels
bug Something isn't working

Comments

@h6197627
Copy link

h6197627 commented Feb 17, 2022

URLs that was excluded from Wayback Machine are not handled properly using CDX Server API (Availability API is fine).
Manual web user interface request returns:

Sorry.

This URL has been excluded from the Wayback Machine.

API request returns:
org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error

waybackpy cdx_utils.py does not expect such response and crashes with exception:

Traceback (most recent call last):
  ...
    for snapshot in cdx.snapshots():
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/cdx_api.py", line 144, in snapshots
    for text in texts:
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/cdx_api.py", line 52, in cdx_api_manager
    total_pages = get_total_pages(self.url, self.user_agent)
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/cdx_utils.py", line 15, in get_total_pages
    return int(response.text.strip())
ValueError: invalid literal for int() with base 10: 'org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error'

To Reproduce
Sample URL: http://gotceleb.com

Version:

  • OS: Ubuntu 18.04
  • Version: 3.0.2
@h6197627 h6197627 added the bug Something isn't working label Feb 17, 2022
@akamhy
Copy link
Owner

akamhy commented Feb 17, 2022

@h6197627 This should probably raise a custom exception instead of ValueError. Maybe BlockedSiteError?

@h6197627
Copy link
Author

h6197627 commented Feb 17, 2022

In my opinion it is better to simply return no snapshots available without exceptions, as from the user perspective, I think, it doesn't matter is it blocked or simply was not archived.
Though in some use cases, which I am not aware of, it might be important to know this information.

@h6197627
Copy link
Author

But probably you are right, if API interpret this situation as an exception, then it is better to stay closer to API. I think custom BlockedSiteError is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants