S3's support for additional checksum algorithms #1005

alexwlchan · 2022-06-10T07:34:31Z

AWS recently released support for new checksum algorithms: SHA-1, SHA-256, CRC-32, and CRC-32C. You can specify a checksum on object upload, and AWS will tell you if the checksum doesn't match. It also stores the checksum as part of the object attributes, which you can retrieve using GetObjectAttributes. This is obviously intriguing as a possible way to do bag verification.

These are my thoughts on the feature, and how we might use it in the storage service.

tl;dr: I think it's an interesting feature we'd be able to use as an additional check in some cases, but it can't replace the bag verifier.

It only supports a limited set of checksum algorithms. We already support SHA-512 in the storage service, and we have at least a few bags that use it. Until AWS add support for that algorithm, we can't verify those checksums.

It only supports a single checksum per object. A bag can contain multiple payload manifests, and we have at least a few bags that do – and the bag verifier will verify every checksum in every manifest. S3 can only verify one of those checksums.

Although I don't think we're bringing in any new bags with multiple checksums, I can imagine it happening for some born-digital content. If a donor supplies, say, MD5 checksums, it'd be nice to create an MD5 manifest as well as the SHA-256 manifest, so we get end-to-end verification as far back as the donor.

It uses a "checksum of checksums" for large objects. If you use multi-part uploads (which you have to use for objects >5GB), what you get isn't a checksum of the object as a whole, but of the parts:

The AWS SDKs now take advantage of client-side parallelism and compute checksums for each part of a multipart upload. The checksums for all of the parts are themselves checksummed and this checksum-of-checksums is transmitted to S3 when the upload is finalized.

The bag verifier gets a checksum for the object as a whole.

We have millions of objects that pre-date this feature. Unless we want to reupload every object already in the storage service (which is an expensive and unnecessary risk), we have millions of objects that don't have these checksum attributes. Although they're already written, we need to be able to re-verify them, e.g. if they're referred to in a future version of a bag by a fetch.txt.

I can imagine us using this as an additional check on objects that (1) use a SHA-256 checksum algorithm and (2) are small enough not to require multi-part upload, but the bag verifier provides more robust checks than this feature.

0
1

alexwlchan added ✅ Bag verification 💬 Discussion labels Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3's support for additional checksum algorithms #1005

S3's support for additional checksum algorithms #1005

alexwlchan commented Jun 10, 2022 •

edited

Loading

S3's support for additional checksum algorithms #1005

S3's support for additional checksum algorithms #1005

Comments

alexwlchan commented Jun 10, 2022 • edited Loading

alexwlchan commented Jun 10, 2022 •

edited

Loading