Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support using checksums that aren't SHA-256 #901

Closed
wants to merge 11 commits into from
Closed

Conversation

alexwlchan
Copy link
Contributor

@alexwlchan alexwlchan commented Jul 19, 2021

Closes #900

Starting with BagIt 1.0, bag creation and validation tools MUST
support the SHA-256 and SHA-512 algorithms [RFC6234] and SHOULD
enable SHA-512 by default when creating new bags. For backwards
compatibility, implementers SHOULD support MD5 [RFC1321] and SHA-1
[RFC3174]. Implementers are encouraged to simplify the process of
adding additional manifests using new algorithms to streamline the
process of in-place upgrades.

This is taking us towards the BagIt 1.0 definition of a "complete" bag:

A complete bag MUST meet the following requirements:

  1. Every required element MUST be present (see Section 2.1).
  2. Every file listed in every tag manifest MUST be present.
  3. Every file listed in every payload manifest MUST be present.
  4. For BagIt 1.0, every payload file MUST be listed in every payload
    manifest. Note that older versions of BagIt allowed payload
    files to be listed in just one of the manifests.
  5. Every element present MUST conform to BagIt 1.0.

A valid bag MUST meet the following requirements:

  1. The bag MUST be complete.
  2. Every checksum in every payload manifest and tag manifest has
    been successfully verified against the contents of the
    corresponding file.

Currently we only support SHA-256 manifests: we'll ignore checksums in other manifest files, and if a bag doesn't include SHA-256 checksums we're unable to verify/register it.

This patch implements full support for SHA-256 and SHA-512, and partial support for MD5 an SHA-1. In particular:

  • When a bag is verified, we'll look at the checksums in every manifest. Previously we would ignore them.
  • The bag register will choose the "best" checksum to use in the storage manifest – SHA-512 if available, otherwise SHA-256.
  • We don't support bags that only have MD5 or SHA-1 checksums – these are for backwards-compatibility only. You need to supply at least one of SHA-256 or SHA-512.
  • It's not possible to store bags that don't have SHA-256 checksums, if the caller provides SHA-512 instead.

In a bit more detail, we add the following new checks to follow the BagIt spec:

  • If the checksum in any manifest is wrong, we fail the bag.
  • If any manifest is missing files, we fail the bag.
  • If the same checksums aren't used consistently, we fail the bag. e.g. a bag with manifest-md5.txt and manifest-sha256.txt would fail.

The bag verifier is an incredibly important service – a mistake in this code would be catastrophic. The patch is too big to safely review as-is, so I'm opening this PR to get high-level feedback. Is this design sensible? Have I missed anything? Is there anything you'd like to double-check?

I'll gradually break this up into smaller PRs for proper review.

@alexwlchan alexwlchan changed the title [WIP] Support using checksums that aren't SHA-256 Support using checksums that aren't SHA-256 Jul 19, 2021
@alexwlchan alexwlchan requested review from jtweed and kenoir July 19, 2021 16:13
// This lookup should never fail -- we should already have validated in
// the BagReader class and the bag verifier that if an algorithm is used,
// it is used to describe every file in the bag.
val checksum = multiChecksum.getValue(algorithm).get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should still throw a sensible error though?

*
* Implementation notes:
*
* - We know the tag manifest is empty because it has to describe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - We know the tag manifest is empty because it has to describe
* - We know the tag manifest is not empty because it has to describe

throw MultiChecksumException.NoChecksum
}

// We support MD5 and SHA1 for backwards compatibility (see RFC 8493 § 2.4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if linking directly to the IETF version of the manifest would be useful (here and elsewhere).

@kenoir
Copy link
Contributor

kenoir commented Jul 20, 2021

I wonder if we should check that the BagIt-Version is >= 1.0 in bagit.txt as part of this as we are implicitly requiring that here.

@kenoir
Copy link
Contributor

kenoir commented Jul 20, 2021

This design seems sensible to me.

Interesting to compare this implementation to https://github.com/LibraryOfCongress/bagit-java

I wonder if we should be pushing our verification code in a direction that makes it less dependent underlying storage provider (in our case we require S3, the LOC version requires a filesystem).

@jtweed
Copy link

jtweed commented Jul 21, 2021

Sorry a bit late to this, but it seems like a really good idea to me. This bit of work is turning into a good opportunity for us to make sure we are compliant with the 1.0 spec. Can't speak to the code, but the logic seems sensible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The storage service should support using checksums that aren't SHA-256
4 participants