Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config payload Digest sha1 base32 #532

Open
gitreich opened this issue Apr 9, 2024 · 2 comments
Open

Config payload Digest sha1 base32 #532

gitreich opened this issue Apr 9, 2024 · 2 comments
Assignees

Comments

@gitreich
Copy link
Contributor

gitreich commented Apr 9, 2024

I couldn't find a setting to configure the Digest to sha1 base32 as our entirley archive (even ARC!) contains Digest with sha1 base32
Actually it is set to sha256 hex
We face problems in the deduplication with the Digest as sha256 hex, as in the CDX is base32 sha1 used, it is not possible to use the CDX for deduplication without regenerating the entire Index.

For us the most easy solution would be to make it configureable as parameter (--digest-encoding string, possilities: base16, base32, base64 and one of them as default (for us base32 would be grat as default) )
see also https://datatracker.ietf.org/doc/html/rfc3548

The Version 0.12.4 was using sha1 base32
Version 1.0.2 is now using sha256 base16

@tw4l tw4l self-assigned this Apr 22, 2024
@tw4l tw4l moved this from Triage to Todo in Webrecorder Projects Apr 22, 2024
@tw4l
Copy link
Member

tw4l commented Apr 22, 2024

Hi @gitreich - putting this on our sprint board to look into after IIPC WAC :)

@gitreich
Copy link
Contributor Author

gitreich commented May 6, 2024

Hi;
At the WAC24 @ikreymer brought up the idea to make a parameter for adding the location of the CDXIndex (for DeDup via writing revisit entries)
If this feature would come, this issue here could be also handled via a CDXParameter:
Read Out of the given CDX the payload digest format and continue writing into the new generated WARCs with the Digest found in the given Index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants