-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrument all connection pool PermitPools #10773
Instrument all connection pool PermitPools #10773
Conversation
This pull request is being automatically deployed with Vercel (learn more). vault-storybook – ./ui🔍 Inspect: https://vercel.com/hashicorp/vault-storybook/29r9hv7v4 [Deployment for af48c7d canceled] |
|
It would be great to get some 👀 on this and merged in, this would be tremendously helpful! |
af48c7d
to
2b2866e
Compare
this has many merge conflicts so I'm going to close it, but I'd be happy to rebase this instrumentation PR if y'all are interested in accepting it in the future |
We've run into cases where we believe that our Vault clusters are saturating the permit pool for our backend, leading to connection queueing and client timeouts.
Currently, we cannot directly verify this using metrics and have to infer it based on behavior. Additionally, when testing an increase to the concurrent connection permits for our backend it is harder to directly verify that the increased permit limit is being consumed as expected. Having insight at runtime into each permit pool's limit and current consumption will help us better understand what limits we are hitting within Vault as we operate it in our production environments.
This PR adds a package (
helper/permits
) that includes a metrics instrumented wrapper of the SDK'sPermitPool
type (InstrumentedPermitPool
).Each instance of the instrumented pool adds 2 metric gauges:
$prefix.permits
: number of permits currently being consumed$prefix.permits-limit
: the permit pools maximum permit limitIf the permit pool size were not configurable in many places, I would've opted for a single "free permits" gauge that counts down to 0 as permits are used rather than the 2 gauges included. But being able to see the current max size line on a graph and being able to calculate the percentage of permits currently in use is valuable if permit maximum can change over time.
I've instrumented the lease revocation/expiration pool and all storage backends that currently use the PermitPool in some form.
One storage backend (the oci backend) already included some similar metrics. The new implementation does not clobber or otherwise disrupt those metrics, and instead adds the new metrics to that backend as well, so that current users of the existing metrics are not impacted, but the backend has the standard named metrics for permit usage going forward as well.
I created
helper/permits
as a separate package to avoid requiring this server side metrics implementation to show up in the SDKphysical
package, if there is a better way to do I am happy to change things.Let me know if there's any other information I can provide, or if you'd like to take a different approach to adding visibility to the semaphore/permit pools used within Vault