Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timezone support #45

Open
rien333 opened this issue Oct 30, 2024 · 0 comments
Open

Timezone support #45

rien333 opened this issue Oct 30, 2024 · 0 comments

Comments

@rien333
Copy link

rien333 commented Oct 30, 2024

If I read rfc3339 correctly, offsets from UTC can be specified with +0X:00. However, if I manually create a pages.jsonl with such timezone suffices, I receive an error when I run wacz create --pages pages.jsonl -f mywarc.warc.gz:

Traceback (most recent call last):
  File "/usr/bin/wacz", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.12/site-packages/wacz/main.py", line 123, in main
    value = cmd.func(cmd)
            ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/wacz/main.py", line 211, in create_wacz
    passed_pages_dict = construct_passed_pages_dict(passed_content)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/wacz/util.py", line 90, in construct_passed_pages_dict
    key = iso_date_to_timestamp(page_dict.pop("ts")) + "/" + url
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/warcio/timeutils.py", line 155, in iso_date_to_timestamp
    return datetime_to_timestamp(iso_date_to_datetime(string))
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/warcio/timeutils.py", line 60, in iso_date_to_datetime
    the_datetime = datetime.datetime(*(int(num) for num in nums))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/warcio/timeutils.py", line 60, in <genexpr>
    the_datetime = datetime.datetime(*(int(num) for num in nums))
                                       ^^^^^^^^
ValueError: invalid literal for int() with base 10: ''

pages.jsonl

{"url":"http://example.archive/example.html","title":"Example","ts":"2013-05-07T00:00:00+01:00","mime":"text/html"}

Seems like the case where the timezone suffix is simply Z is handled correctly, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant