-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
goroutine leak during blocking queries #15010
Comments
The early returns of Function
When watchLimit (8192) is reached, it starts to watch |
Just want to provide an update. We have reproduced without using an artificially large index value. #1. We create 10K instance large-service with the gag test-tag. We create a second single instance service dummy-service. If we do #3 repeatedly, we can produce very large goroutine values. Once the blocking query returns -- on a change to large-service or on a timeout of the blocking query, the goroutine value returns to normal. This matches what we see in our large production cluster. We have a job that does a /v1/health/service blocking query with a tag on a 4193 instance service with a 10m timeout. By the time the whole 10m elapses, we see the spikes up to 1M that @mechpen mentions. We enabled streaming for the nodes running the instances of the job issuing these queries and the problem went away. Even though streaming helps with our current issue, we'd like to see this fixed, since we're worried about new queries being introduced to our clusters from nodes without streaming enabled or using calls that do not support streaming. We are worried if a lot of calls got introduced in short order, it could kill our Consul cluster. |
Overview of the Issue
We observed frequent spikes in the number of goroutines, from 40k to ~1M. We located the root cause to be a goroutine leak in go-memdb.
Reproduction Steps
Steps to reproduce this issue:
large-service
, tag:test-tag
.curl "localhost:8500/v1/catalog/service/large-service?tag=test-tag&index=99999999999"
large-service
.The curl command should not return because the index is very large (expected).
The go routine count spikes (not expected).
Cause analysis
The function
blockingQuery()
has a for loop that calls go-memdbWatchSet.WatchCtx()
to watch the service-related states. When any related state changes, theWatchCtx()
function returns. If the service does not actually change, or theMinQueryIndex
is large, then the for loop starts a new iteration. SoWatchCtx()
could be called many times in oneblockingQuery()
call.The go-memdb
WatchSet.watchMany()
has a bug that leaks goroutines. This causes consul goroutine spikes during blocking queries. These goroutines are cleaned up when the blocking query returns, either due to the service update or time out.Consul info for both Client and Server
Found in consul 1.11.7-ent.
The text was updated successfully, but these errors were encountered: