Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument all connection pool PermitPools #10773

Closed

Conversation

ianferguson
Copy link
Contributor

@ianferguson ianferguson commented Jan 26, 2021

We've run into cases where we believe that our Vault clusters are saturating the permit pool for our backend, leading to connection queueing and client timeouts.

Currently, we cannot directly verify this using metrics and have to infer it based on behavior. Additionally, when testing an increase to the concurrent connection permits for our backend it is harder to directly verify that the increased permit limit is being consumed as expected. Having insight at runtime into each permit pool's limit and current consumption will help us better understand what limits we are hitting within Vault as we operate it in our production environments.

This PR adds a package (helper/permits) that includes a metrics instrumented wrapper of the SDK's PermitPool type (InstrumentedPermitPool).

Each instance of the instrumented pool adds 2 metric gauges:

  • $prefix.permits: number of permits currently being consumed
  • $prefix.permits-limit: the permit pools maximum permit limit

If the permit pool size were not configurable in many places, I would've opted for a single "free permits" gauge that counts down to 0 as permits are used rather than the 2 gauges included. But being able to see the current max size line on a graph and being able to calculate the percentage of permits currently in use is valuable if permit maximum can change over time.

I've instrumented the lease revocation/expiration pool and all storage backends that currently use the PermitPool in some form.

One storage backend (the oci backend) already included some similar metrics. The new implementation does not clobber or otherwise disrupt those metrics, and instead adds the new metrics to that backend as well, so that current users of the existing metrics are not impacted, but the backend has the standard named metrics for permit usage going forward as well.

I created helper/permits as a separate package to avoid requiring this server side metrics implementation to show up in the SDK physical package, if there is a better way to do I am happy to change things.

Let me know if there's any other information I can provide, or if you'd like to take a different approach to adding visibility to the semaphore/permit pools used within Vault

@vercel
Copy link

vercel bot commented Jan 26, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

vault-storybook – ./ui

🔍 Inspect: https://vercel.com/hashicorp/vault-storybook/29r9hv7v4
✅ Preview: Canceled

[Deployment for af48c7d canceled]

@vercel vercel bot temporarily deployed to Preview – vault January 26, 2021 15:08 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook January 26, 2021 15:08 Inactive
@ianferguson ianferguson changed the title Instrument all connection pool semaphores Instrument all connection pool PermitPools Jan 26, 2021
@vercel vercel bot temporarily deployed to Preview – vault-storybook January 26, 2021 15:18 Inactive
@vercel vercel bot temporarily deployed to Preview – vault January 26, 2021 15:18 Inactive
@ianferguson
Copy link
Contributor Author

TestRaft_SnapshotAPI_RekeyRotate_Backward/rotate - github.com/hashicorp/vault/vault/external_tests/raft is failing in test-go-race, but I don't think that failure would be related to any changes in this PR

@brandoncole
Copy link

It would be great to get some 👀 on this and merged in, this would be tremendously helpful!

@ianferguson ianferguson force-pushed the ianferguson/storage_semaphore_gauges branch from af48c7d to 2b2866e Compare February 6, 2021 15:20
@ianferguson
Copy link
Contributor Author

this has many merge conflicts so I'm going to close it, but I'd be happy to rebase this instrumentation PR if y'all are interested in accepting it in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants