Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage check -- can it be removed? #201

Closed
ankon opened this issue Aug 24, 2022 · 12 comments
Closed

Storage check -- can it be removed? #201

ankon opened this issue Aug 24, 2022 · 12 comments
Labels
feature request Request for new feature or functionality

Comments

@ankon
Copy link
Contributor

ankon commented Aug 24, 2022

certmagic/config.go

Lines 456 to 461 in 76f61c2

// ensure storage is writeable and readable
// TODO: this is not necessary every time; should only perform check once every so often for each storage, which may require some global state...
err := cfg.checkStorage(ctx)
if err != nil {
return fmt.Errorf("failed storage check: %v - storage is probably misconfigured", err)
}

Depending on the storage this check can be quite expensive (in our case a remote KV store), and it happens in a rather "hot" area: Right when setting up a TLS connection. If we need to get a certificate from a CA this doesn't matter too much in the grand scheme of things, but if our storage has one we triple our roundtrips (1 to get the cert information from storage, and 2 for a write/read from storage to check it).

I think this check might have some value, but not here:

  • If the storage is generally trustworthy, it should be checked before it gets configured for certmagic
  • If the storage is generally untrustworthy and needs regular checks, it should be part of that specific storage

So, I'd propose to just drop these lines here, and possibly move the implementation of checkStorage as an example into documentation. WDYT?

@mholt
Copy link
Member

mholt commented Aug 24, 2022

I'm open to optimizations, but first:

It should already only do the check if a certificate is needed, which should only be "hot" once every 60 days or so -- and even that runs in the background except the very first time. Shouldn't be blocking a TLS handshake after the first time it gets a certificate.

What kinds of performance impacts are you experiencing? How have you verified that the storage checks are causing performance problems?

@ankon
Copy link
Contributor Author

ankon commented Aug 24, 2022

When the server using on-demand TLS with certmagic starts up, its in-memory caches are empty, and that's what I'm trying to optimize right now. Our Storage is using AWS DynamoDB (based on https://github.com/silinternational/certmagic-storage-dynamodb), and as much that scales, it does see quite some load I'd like to get rid off.

But, you're right indeed ... before it would do the storage check, obtainCert first checks with storageHasCertResourcesAnyIssuer for any existing resources. This call should return true (our implementation of Exists internally does a Load and checks for whether that worked), and following the code from there I think it will pick up the certs through an actual Storage.Load in Config.CacheManagedCertificate/Config.loadManagedCertificate.

Leaves me still with some not-so-nice performance if the above is true: I should end up with 3 distinct read requests in storageHasCertResourcesAnyIssuer, and another 3 in loadManagedCertificate. I'll have to measure this more closely to confirm this theory. Ideally I'd like this number to be 1, of course :)

I guess I could get there by
a) Have a tiny internal cache with a short expiration for Load, so that the second load just grabs from that cache rather than roundtrip
b) recognize the storageHasCertResourcesAnyIssuer and pre-load all keys into that cache (in DynamoDB through BatchGetItem or possibly by using a smart query/index operation)

It also IMHO still leaves the question open whether the storage check would not be better in the storage itself, so that an implementation could decide how to check?

@mholt
Copy link
Member

mholt commented Aug 24, 2022

The point of the storage check is to ensure the storage is properly configured, it's not really dependent upon how "trustworthy" the storage mechanism is.

It's tempting to say that just the file system needs such a check to ensure there's enough disk space, but actually most problems I've heard of in production stem from getting its CertMagic config wrong, or the actual storage system itself is misconfigured.

Maybe we could make storage checks optional, but disabling them poses a risk too.

(I suppose ideally we'd optimize them. I have a TODO that they don't need to happen so frequently, but haven't gotten around to implementing that. Even if it's just a simple memory of last time the storage was checked.)

@ankon
Copy link
Contributor Author

ankon commented Aug 25, 2022

What about sth like this:

diff --git a/config.go b/config.go
index ed155d5..266415d 100644
--- a/config.go
+++ b/config.go
@@ -976,11 +976,19 @@ func (cfg *Config) getChallengeInfo(ctx context.Context, identifier string) (Cha
 	return Challenge{Challenge: chalInfo}, true, nil
 }
 
-// checkStorage tests the storage by writing random bytes
-// to a random key, and then loading those bytes and
+// checkStorage tests the storage
+//
+// If the storage implements `StorageChecker` its StorageChecker.Check
+// method will be called, otherwise the storage is checked by writing
+// random bytes to a random key, and then loading those bytes and
 // comparing the loaded value. If this fails, the provided
 // cfg.Storage mechanism should not be used.
 func (cfg *Config) checkStorage(ctx context.Context) error {
+	if sc, ok := cfg.Storage.(StorageChecker); ok {
+		// Storage knows how to check itself, so let it do that.
+		return sc.Check(ctx)
+	}
+
 	key := fmt.Sprintf("rw_test_%d", weakrand.Int())
 	contents := make([]byte, 1024*10) // size sufficient for one or two ACME resources
 	_, err := weakrand.Read(contents)
diff --git a/storage.go b/storage.go
index 0bb34cd..4fc0cfe 100644
--- a/storage.go
+++ b/storage.go
@@ -81,6 +81,12 @@ type Storage interface {
 	Stat(ctx context.Context, key string) (KeyInfo, error)
 }
 
+type StorageChecker interface {
+	// Check returns whether the storage is "good to go" or
+	// whether it cannot be used right now
+	Check(ctx context.Context) error
+}
+
 // Locker facilitates synchronization across machines and networks.
 // It essentially provides a distributed named-mutex service so
 // that multiple consumers can coordinate tasks and share resources.

For me it would allow me to implement the Check method (for instance by only checking once), for existing code nothing would change, and other more advanced storages could indeed implement some "once in while" behavior without needing a change in certmagic itself?

@mholt
Copy link
Member

mholt commented Aug 26, 2022

Maybe, I still feel like storage problems are more related to configuration than actual implementation (usually -- assuming a proper implementation and a capable storage backend).

I feel like a switch on or off in the config is probably a better solution to this. Would that be OK with you?

@ankon
Copy link
Contributor Author

ankon commented Aug 26, 2022

I feel like a switch on or off in the config is probably a better solution to this. Would that be OK with you?

Sure, that would work for us! Shall I make a PR to that effect?

@mholt
Copy link
Member

mholt commented Sep 1, 2022

@ankon Sorry for my delayed reply. Yes, I'd be happy to review that. Probably have the checks enabled by default, with the option to disable them.

@mbardelmeijer
Copy link
Contributor

I would also love to see an option to disable this. Our use-case would be the same as @ankon but for a S3 storage driver :)

@mholt
Copy link
Member

mholt commented Sep 15, 2022

@mbardelmeijer That's good to know. But be aware that S3 is not safe to use as a storage backend by itself, since it cannot support atomic operations.

@mholt mholt added the help wanted Extra attention is needed label Sep 15, 2022
@mholt mholt closed this as completed in 5357574 May 6, 2023
@mholt
Copy link
Member

mholt commented May 6, 2023

@mbardelmeijer @ankon I've implemented this in 5357574. Please feel free to try it out!

@mholt mholt added feature request Request for new feature or functionality and removed help wanted Extra attention is needed labels May 6, 2023
ankon pushed a commit to framer/certmagic that referenced this issue May 11, 2023
Not a good idea most of the time though.
@ankon
Copy link
Contributor Author

ankon commented May 11, 2023

Thanks a lot!

We just rolled this out for a spin, and according to our database statistics this looks great -- no more rw_test_ keys in the "top contributors" diagrams at all. We're going to leave it running for a while now, and see whether overall statistics also show improvements.

@mholt
Copy link
Member

mholt commented May 11, 2023

Awesome, thanks for giving it a spin!

ankon added a commit to framer/caddy that referenced this issue Jun 3, 2024
ankon added a commit to framer/caddy that referenced this issue Jun 3, 2024
mholt pushed a commit to caddyserver/caddy that referenced this issue Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for new feature or functionality
Projects
None yet
Development

No branches or pull requests

3 participants