-
Notifications
You must be signed in to change notification settings - Fork 9
Reworked provenance verification. #246
Reworked provenance verification. #246
Conversation
proto/verification_options.proto
Outdated
// For every digest, provides a hint about the format. Right now only | ||
// "sha2-256" is supported. Empty string is not permitted. | ||
repeated string formats = 1; | ||
|
||
// The actual digests - must have the same length as the formats field. | ||
// A hex-encoding is assumed for all digests. Empty string is not permitted. | ||
repeated string digests = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the semantics of these two fields are. Say there are 3 items in the formats
and 4 items in digests
, what would that mean? I think there should be another proto somewhere else to represent the digest, and use that as a repeated type here. Or just use the human readable format as a single string anyways, since AFAICT you expect users to encode it in a textproto, so having a binary digest is probably not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(formats) != len(digests) is invalid/undefined behavior. It is not a map but a multimap (same key may occur > 1 times).
I agree. We should use this opportunity and unify the variants of Digests proto into just one. Defining basic protos in the open source is a good idea because we depend on them both here and internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
proto/verification_options.proto
Outdated
// Verifies that the repository coincides with the specified one, for all | ||
// available provenances. | ||
message VerifyAllWithRepository { | ||
string repositoryUri = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string repositoryUri = 1; | |
string repository_uri = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/endorser/README.md
Outdated
If the verification options should be kept in a file (for length reasons), then use | ||
```bash | ||
... | ||
--verification_options="$(cat /tmp/ver_opts.textproto)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use the shell itself to do this, no need for cat
, something like
--verification_options="$(cat /tmp/ver_opts.textproto)" | |
--verification_options="$(</tmp/ver_opts.textproto)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
``` | ||
|
||
See [this comment](https://github.com/project-oak/oak/pull/4191#issuecomment-1643932356) as the | ||
source of the binary and provenance info. | ||
If the verification options should be kept in a file (for length reasons), then use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I think the common case is to keep the verification in a file, so we can just point at the file name, or require setting an explicit --skip-verification
flag for extra safety.
Who is going to try and embed a textproto in a flag in practice? Even if it is allowed, it seems very error prone :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the binary digest is part of the verification, then the textproto changes on every call. Also, it is likely called by another script, so we cut the disk access and the temporary files. Yes, I think it's about calling this directly or by another script. In the former case, you'd prefer passing a file, in the latter case, passing it directly.
It is a long command line if there are many checks, but I really liked verifying the counts directly because it is expressive (OTOH you see the number of URIs passed in the other argument):
--verification_options="provenance_count_at_least { count: 2 }"
--skip_verification is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -1,4 +1,4 @@ | |||
// Copyright 2022 The Project Oak Authors | |||
// Copyright 2022-2023 The Project Oak Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these need to be updated at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read up: It is either "single year" or "year range", and somehow I was biased towards the year range. When using single year, then it is the year of publication. When using year range, the first year is the publication, and the second is when it was last revised.
I could find no recommendation as when to use either one. So we use the single year variant throughout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only ever seen single year. https://goto.google.com/copyright#the-year says that the multi year range is neither necessary nor encouraged, so I'd just leave everything with the initial year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to take a step back for a moment.
Conceptually, is it correct to say that the tool does the following steps?
- download zero or more provenance files
- run some predicates / tests on such files (checking some properties about them, possibly related to some binary somewhere)
- if successful, generate an unsigned provenance file
- this file may then be passed to some other tool / process for the actual signing
First and second bullet point are all correct. For the third, I am confused by the term "unsigned provenance file". Shouldn't it be "unsigned endorsement"? The output file is endorsement.json, containing a header ("_type": "https://in-toto.io/Statement/v0.1", "predicateType": "https://github.com/project-oak/transparent-release/claim/v1"), identification of the underlying binary, endorsement timestamps, and a link to the provenances. The endorsement follows this spec: https://github.com/in-toto/attestation/tree/v0.1.0/spec#statement |
Ah yes that's my bad, I certainly meant unsigned endorsement (not provenance). Anyways the point I'm getting at, is that there are really 3 steps, and the only reason they currently are in the same binary is that the first step determines whether the rest of the steps should run (which is a good reason, so nothing gets signed unless it's been verified etc.). But this has its own problems, e.g. flags to skip one or more of the steps, or change their semantics. Or if we wanted to run the checks as a stand-alone test in google3. In fact, will we ever use this externally? AFAICT we will only need it internally, and probably as part of a script to run on someone workstation or a Rapid workflow or something like that. Is that correct? |
I assumed that it is only on github, written in golang, and all three steps in one binary, in order to enhance/harden the story around security. If it were not for the security story, we could give up the entire transparent-release repository. More important are the data formats. I wonder if JSON is a good choice. It is flexible but rather fuzzy. How much should we lean on in-toto? Probably in-toto exists somewhere internally, but haven't checked. The Go-Python interface used internally is just to call the Go binary from Python, which is OK but not great. Summarizing, it would be much more convenient to have it all internal (and maybe not in Go). |
A couple of notes: We still need the SLSA provenance generation in open source (not sure how much of that is in SLSA now vs here though). We also need the endorsement format to be a standard (in-toto seems sensible for that), since our other open source libraries will also need to parse it (e.g. as part of attestation verification). Plus in the future we may want to attach claims to endorsements (e.g. saying what we endorse a binary for, instead of just endorsing it globally). But for the parts between provenance verification and endorsement generation, we have more flexibility. And AFAICT we don't expect anyone other than us to actually run those steps. Although it would be nice to one day have a website in which we explain why we endorsed each binary, with either a formal or informal description of the steps we took before endorsing it. But I don't think that's necessary in the short term, so we can keep developing it internally. We should keep the idea of having a "policy" expressed via protobuf and getting a yes / no answer about a binary, so that eventually we can publish the policy as well as the policy verification, but we don't need to literally use the same binary internally and externally. We only need the policy to be inspectable externally, and possibly runnable in some way. So yes, let's make some pragmatic decisions about where the code should live based on how we plan to use it first, since that's the main priority. |
Thoughts: If our basic protos (verification_options, but digest even more) move to internal, no GitHub code can depend on them. In this regard we lock the door behind us. But it should be fine. About programming language: It is nice to have Go here, but internally we have the readability burden, plus the language interoperability. Therefore, python seems the better choice for internal code. |
option go_package = "proto/oak/release"; | ||
|
||
// Contains various digest formats for the same underlying file. | ||
message Digest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics is still not super clear IMO. Does this represent a single digest? Multiple digests that must all match? What if there are conflicting digests in binary and hex format, which one "wins"? IMO this is too flexible / ambiguous and will lead to bugs or vulnerabilities. I would just define a single message for a single digest using a single format, or at the very least have super clear comments describing all the steps to validate this message (but then that will have to be implemented in all the various languages).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the comment actually says that they refer to the same file, but does it mean that one of them must match, or multiple ones? (this is not a new issue introduced by this PR, just something that we have not fully addressed yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single Digest proto refers to a single File. So all digests contained in a single instance of Digest must be compatible in the sense that they come from the same file. Otherwise, it is undefined behavior. The type of digest and encoding is left open. Can specify more than one as needed.
The actual matching is addressed by having a repeated field further up, so if at least one digest in the repeated field matches, then this is a pass.
BTW I am planning to use this as the baseline proto for any Digest. There are a few equivalent versions around. This allows to have reusable utilities for Digest, e.g. matching one Digest against the other.
Made it more mechanical to better manage expectations by the caller: verification options are a sequence of straightforward checks on an array of provenances. Outcome is either pass or fail (with error) as before.