Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tuf: improve TUF client concurrency and caching #1953

Merged
merged 5 commits into from
Jun 7, 2022

Conversation

asraa
Copy link
Contributor

@asraa asraa commented Jun 2, 2022

Summary

  • Moves Rekor public key retrieval from API inside GetRekorPubs.
  • TUF objects always use in-memory local stores and target store. If caching is enabled, we sync with the on-disk store on (one-time) start and re-sync on any updates. We do not keep the local disk cache open, so we don't hog the OS file lock.
  • TUF objects are now singletons (see POC from @vaikas DNM concurrent NewFromEnv tests #1941): this means no re-initialization for each process, we can re-use the TUF object.

Ticket Link

Partial fix #1935

Release Note


@asraa asraa force-pushed the fix-tuf-client branch from 4497372 to 227c133 Compare June 2, 2022 18:06
@codecov-commenter
Copy link

codecov-commenter commented Jun 2, 2022

Codecov Report

Merging #1953 (8719e02) into main (ae90c74) will increase coverage by 0.69%.
The diff coverage is 53.19%.

@@            Coverage Diff             @@
##             main    #1953      +/-   ##
==========================================
+ Coverage   34.00%   34.69%   +0.69%     
==========================================
  Files         153      153              
  Lines        9981    10076      +95     
==========================================
+ Hits         3394     3496     +102     
  Misses       6208     6208              
+ Partials      379      372       -7     
Impacted Files Coverage Δ
cmd/cosign/cli/fulcio/fulcioroots/fulcioroots.go 37.83% <ø> (+1.47%) ⬆️
cmd/cosign/cli/fulcio/fulcioverifier/ctl/verify.go 50.00% <ø> (+0.73%) ⬆️
cmd/cosign/cli/verify/verify_blob.go 10.25% <0.00%> (-0.04%) ⬇️
pkg/cosign/tuf/testutils.go 0.00% <0.00%> (ø)
pkg/cosign/tlog.go 30.07% <11.11%> (+0.94%) ⬆️
pkg/cosign/tuf/client.go 64.72% <66.03%> (+1.91%) ⬆️
pkg/cosign/verify.go 28.85% <75.00%> (-3.93%) ⬇️
...is/policy/v1beta1/clusterimagepolicy_validation.go 93.06% <0.00%> (-0.78%) ⬇️
...s/policy/v1alpha1/clusterimagepolicy_validation.go 93.06% <0.00%> (-0.69%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae90c74...8719e02. Read the comment docs.

@asraa asraa force-pushed the fix-tuf-client branch from 94cfb96 to 078553a Compare June 2, 2022 19:06
@asraa asraa marked this pull request as ready for review June 2, 2022 20:16
@dlorenc
Copy link
Member

dlorenc commented Jun 2, 2022

Should we drop WIP from the title?

@asraa
Copy link
Contributor Author

asraa commented Jun 2, 2022

Should we drop WIP from the title?

I just realized one more clean-up I can do to this PR more clean, and will drop it then! I'll ping back for a review

@dlorenc
Copy link
Member

dlorenc commented Jun 2, 2022

Cool! Nice job here :)

Co-authored-by: Ville Aikas <[email protected]>
Signed-off-by: Asra Ali <[email protected]>
@asraa asraa force-pushed the fix-tuf-client branch from 078553a to 929e797 Compare June 2, 2022 20:43
@asraa asraa mentioned this pull request Jun 2, 2022
@asraa asraa changed the title WIP: TUF client improvements: use in-memory store tuf: improve TUF client concurrency and caching Jun 2, 2022
@asraa
Copy link
Contributor Author

asraa commented Jun 2, 2022

cc @dlorenc @vaikas @haydentherapper @znewman01

Ready for review! Addressed some of the clean-up issues, expect more to come, including (listed in #1935):

  • Move usage of the TUF client to populate CheckOpts: we do this for Fulcio root, but not for the Rekor pubs. I think the cosign as a library will be a lot more clear if the TUF client was separated from the Verification library functions.
  • Cleanup on the Rekor pub keys alternative env vars
  • Minor clean-up on the GetRemoteRoot and GetEmbedded exported functions.

func GetRekorPubs(ctx context.Context) (map[string]RekorPubKey, error) {
// SIGSTORE_TRUST_REKOR_API_PUBLIC_KEY - If specified, fetches the Rekor public
// key from the Rekor server using the provided rekorClient.
// TODO: Rename SIGSTORE_TRUST_REKOR_API_PUBLIC_KEY to be test-only or remove.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd say remove entirely - Ultimately this will get used outside of testing, and at least with the other env var, the file is provided out of band rather than fetching the key directly from the source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought here is that being able to fetch the public key from Rekor is part of the Rekor API, and whether this flag is here or not is only one part of getting used 'outside of testing'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s fine to provide an API to fetch the public key, but there needs to be an explicit strategy around root of trust management. Ideally it’d always be TUF for Sigstore, but consumers may choose to TOFU - but they must persist the key, verification logic can’t rely on fetching the key each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it should be removed -- but I'd like to do this in another PR because the current webhook and tests rely on it and I need to dig in separately into that.

// singletonTUF holds a single instance of TUF that will get reused on
// subsequent invocations of initializeTUF
singletonTUF *TUF
singletonTUFMu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a Once to control access instead of a mutex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably could. I just wasn't sure if we'd be swapping the TUF object on update, or just modify parts of it on update. I think we might just be able to modify the in-mem map during updates, but as I said wasn't sure what the level of modification was right so started with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are modifying parts of it on update -- Once is feasible, I can work with reseting it during tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I dug in while I was attempting this change: we don't end up updating the TUF client unless we re-initialize. Since this PR uses the singleton/mutex we don't actually support long-running processes. I'm going to change to Once since it's behavior will be the same as a mutex, and call out a follow-up to support long-running process updates in the Cleanup issue (in practice this isn't that big of an issue right now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think Once is cleaner since it explicitly shows that the singleton should be initialized only once.

@@ -105,6 +108,12 @@ type remoteCache struct {
Mirror string `json:"mirror"`
}

func resetForTests() {
singletonTUFMu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be wrapped in a mutex? If tests aren't run in parallel the lock isn't needed. If tests are being run in parallel, then wouldn't deleting the singleton be an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since some of the tests are expecting a 'fresh' TUF setup, I added this to make sure we don't get any errors from modifying it without a lock in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests within the tuf pkg will be sequential, and we need the reset (but as you said, don't need the lock).

There is 1 test outside the TUF package that uses a custom TUF root. I'm concerned about that, that may be run in parallel. I will be refactoring that anyway to use a custom CheckOpts for the Rekor key rather than a custom TUF set up.

In the interim, I need to think how to lock that test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK resolution: I added a TODO for that test that uses a custom root outside this package, expect a PR this afternoon addressing that.

return t.getRootStatus()
}

// Close closes the local TUF store. Should only be called once per client.
func (t *TUF) Close() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, this is no longer needed because we only access the TUF DB to write the in-memory metadata on initialization if not present, or read it if present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's right! For longer running processes (like the webhook) there should be a way to refresh the data, but @asraa has some ideas there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For longer running processes we are also refreshing now -- this is handled in updateClient: we Update our local in-memory metadata, then syncMetadata to the DB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused why we need the mutex. Is it because there's a concern that initializeTUF will called multiple times by other processes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't -- I switched over to Once initialization, and reset the sync.Once for tests.

The only time it will be called multiple times potentially in parallel is because of the one test outside the TUF pkg that may run in parallel with the TUF tests.

pkg/cosign/tuf/client.go Outdated Show resolved Hide resolved
}
// Sync the in-memory local store to the on-disk cache.
tufDB := filepath.FromSlash(filepath.Join(rootCacheDir(), "tuf.db"))
diskLocal, err := tuf_leveldbstore.FileLocalStore(tufDB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this API expected to be used, or are we reaching into private APIs? I just wanna make sure this is maintainable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, FileLocalStore is meant to be used!

pkg/cosign/tuf/client.go Outdated Show resolved Hide resolved
pkg/cosign/tuf/client.go Show resolved Hide resolved
pkg/cosign/tuf/client.go Show resolved Hide resolved
Copy link
Contributor

@vaikas vaikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asraa for getting this wrangled into shape and @haydentherapper for the reviews!.

return t.getRootStatus()
}

// Close closes the local TUF store. Should only be called once per client.
func (t *TUF) Close() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's right! For longer running processes (like the webhook) there should be a way to refresh the data, but @asraa has some ideas there.

// singletonTUF holds a single instance of TUF that will get reused on
// subsequent invocations of initializeTUF
singletonTUF *TUF
singletonTUFMu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably could. I just wasn't sure if we'd be swapping the TUF object on update, or just modify parts of it on update. I think we might just be able to modify the in-mem map during updates, but as I said wasn't sure what the level of modification was right so started with this.

@@ -105,6 +108,12 @@ type remoteCache struct {
Mirror string `json:"mirror"`
}

func resetForTests() {
singletonTUFMu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since some of the tests are expecting a 'fresh' TUF setup, I added this to make sure we don't get any errors from modifying it without a lock in place.

singletonTUF *TUF
singletonTUFMu sync.Mutex
singletonTUF *TUF
singletonTUFOnce = new(sync.Once)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I actually didn't know that go had allocation via new.

I don't mind this approach, but I found that someone else had the same problem and made sync but with Reset - https://github.com/matryer/resync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i saw this as well and ideally wanted to avoid a totally new dependency just for testing -- I'm not sure if there's a big downside to this approach though.

if ok && !isExpiredTimestamp(trustedTimestamp) && !forceUpdate {
// We're golden so stash the TUF object for later use
singletonTUF = t
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this returning nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's returning out of the sync.Once func, but we still return singletonTUF from initializeTUF

var err error
t.local, err = newLocalStore()
if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn, is this the consequence of sync.Once? If so, this isn't viable imo, we shouldn't add a panic into the client because if a server starts depending on it, it opens up an opportunity for query of deaths. Is there a way to return an error? If not, what you had before would be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on not panicing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do a global variable - https://stackoverflow.com/questions/42069615/too-many-arguments-to-return-error

Just ran into this same issue in fulcioroots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the global!

Signed-off-by: Asra Ali <[email protected]>
Copy link
Contributor

@haydentherapper haydentherapper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@znewman01
Copy link
Contributor

LGTM, great work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup for TUF code
6 participants