Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POTENTIAL DEADLOCK: When the rate limit quotas setup takes more time to load the quota config and rules #21338

Closed
Ajaykumarkolipaka opened this issue Jun 19, 2023 · 3 comments · Fixed by #21342

Comments

@Ajaykumarkolipaka
Copy link

Ajaykumarkolipaka commented Jun 19, 2023

Describe the bug
A clear and concise description of what the bug is.
We tried to apply the rate limit quotas on approximately ~30K paths. When the vault is restarting during it's post unseal process it's trying to setup the rate limit quotas config and rules by acquiring the locks here which is taking more time i.e > 30 seconds and as the vault is unsealed other go routines are making api calls like sys/health, sys/metrics and this is leading to a deadlock situation as the vault is verifying if the path is in the exempt list by acquiring the locks again here

To Reproduce
Steps to reproduce the behavior:

  1. Set time.sleep(2 * time.minute) statement here
  2. <if you want to skip step-1, create > 30K paths and apply rate limit quotas on all the > 30K paths>
  3. Restart the vault
  4. See error
    `POTENTIAL DEADLOCK:
    Previous place where the lock was grabbed
    goroutine 138 lock 0xc00190a618
    /tmp/project/vault/quotas/quotas.go:1026 quotas.(*Manager).Setup ??? <<<<<
    /tmp/project/vault/core.go:3272 vault.(*Core).setupQuotas ???
    /tmp/project/vault/core.go:2332 vault.standardUnsealStrategy.unseal ???
    /tmp/project/vault/core.go:2455 vault.(*Core).postUnseal ???
    /tmp/project/vault/ha.go:659 vault.(*Core).waitForLeadership ???
    /tmp/project/vault/ha.go:479 vault.(*Core).runStandby.func9 ???
    /tmp/project/vendor/github.com/oklog/run/group.go:38 run.(*Group).Run.func1 ???

Have been trying to lock it again for more than 30s
goroutine 5159 lock 0xc00190a618
/tmp/project/vault/quotas/quotas.go:731 quotas.(*Manager).RateLimitPathExempt ??? <<<<<
/tmp/project/vault/core.go:3287 vault.(*Core).ApplyRateLimitQuota ???
/tmp/project/http/util.go:63 http.rateLimitQuotaWrapping.func1 ???
/usr/local/go/src/net/http/server.go:2122 http.HandlerFunc.ServeHTTP ???
/tmp/project/http/handler.go:440 http.wrapGenericHandler.func1 ???
/usr/local/go/src/net/http/server.go:2122 http.HandlerFunc.ServeHTTP ???
/tmp/project/vendor/github.com/hashicorp/go-cleanhttp/handlers.go:42 go-cleanhttp.PrintablePathCheckHandler.func1 ???
/usr/local/go/src/net/http/server.go:2122 http.HandlerFunc.ServeHTTP ???
/usr/local/go/src/net/http/server.go:2936 http.serverHandler.ServeHTTP ???
/usr/local/go/src/net/http/server.go:1995 http.(*conn).serve ???
`

Expected behavior
A clear and concise description of what you expected to happen.
Vault should able to restart with out any issues and should able to load the rate limit quotas and rules in less time.

Environment:

  • Vault Server Version (retrieve with vault status): 1.13.1
  • Vault CLI Version (retrieve with vault version):
  • Server Operating System/Architecture: aws eks and dyanmodb as the storage

Vault server configuration file(s):

# Paste your Vault config here.
# Be sure to scrub any sensitive values
 "api_addr     = "<vault addr>:8200"
 cluster_addr = "https://$(POD_IP_ADDR):8201"
 log_level = "trace"
 ui = true
 seal "awskms" {
    region     = ""
    kms_key_id = ""
}
storage "dynamodb" {
region     = ""
table      = ""
ha_enabled = "true"
max_parallel = "25"
}
listener "tcp" {
    address       = "127.0.0.1:8200"
    max_request_duration = "90s"
    http_read_timeout = "30s"
    tls_disable_client_certs = "false"
    tls_prefer_server_cipher_suites = "true"
    tls_min_version = "tls12"
    tls_cipher_suites = ""
    tls_cert_file = ""
    tls_key_file = ""
    telemetry {
        prometheus_retention_time = "1h"
        disable_hostname = true
        enable_hostname_label = true
        unauthenticated_metrics_access = true
    }
}
listener "tcp" {
    address       = "$(POD_IP_ADDR):8200"
    max_request_duration = "90s"
    http_read_timeout = "30s"
    tls_disable_client_certs = "false"
    tls_prefer_server_cipher_suites = "true"
    tls_min_version = "tls12"
    tls_cipher_suites = ""
    tls_cert_file = ""
    tls_key_file = ""
    proxy_protocol_behavior= "use_always"
}
plugin_directory = "/etc/vault/plugins"
telemetry {
    prometheus_retention_time = "1h"
    disable_hostname = true
    enable_hostname_label= true
    unauthenticated_metrics_access = true
}

Additional context
Add any other context about the problem here.

@ncabatoff
Copy link
Collaborator

Hi @Ajaykumarkolipaka,

I agree with most of your analysis, but I don't think this is quite right:

and this is leading to a deadlock situation

The way I would phrase it is: if you have a huge number of quota rules, then you may get a POTENTIAL DEADLOCK log message. We're using a library for some of our mutexes that tries to detect deadlocks, and it uses a heuristic whereby if someone has been trying to grab the lock for 30s or longer, it thinks that might be a deadlock, and so it logs this message. In fact there is no deadlock here, it's just that you have a very high number of quota rules, which takes a long time to load, which fools the library into thinking there might be a deadlock.

My read of the situation is that everything is working as designed. Why do you have so many quota rules? Can you simplify them, e.g. by setting them at the mount level instead of the path level?

@Ajaykumarkolipaka
Copy link
Author

Ajaykumarkolipaka commented Jun 19, 2023

Hi @ncabatoff , Thank you for the response. I do have below query, can you please help me understand this?

if the POTENTIAL DEADLOCK is just a log message from the library, at-least after couple of minutes vault should be up and running, but in my case vault is getting crashed. Is this expected or am I missing something here?
panic: Failed to execute /usr/local/bin/docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh server: err=exit status 2

on a side note:

by setting them at the mount level instead of the path level

Yes, configuring the rate limit quotas at the mount level is a good idea when we have huge number of paths.

@ncabatoff
Copy link
Collaborator

Is this expected or am I missing something here?

No, I'm forgetting that that's a side-effect of the deadlock detector default behaviour. That's probably a bug we should fix, I'll look into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants