-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When database plugin Clean is called, close connections in goroutines #15923
Conversation
builtin/logical/database/backend.go
Outdated
for _, db := range connectionsCopy { | ||
go func(db *dbPluginInstance) { | ||
b.addConnectionsCounter(db.database, -1) | ||
db.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case of shutdowns, since we're no longer waiting for Close to return could this lead to dangling connections on the database side since the vault process will exit before this finishes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I mean, theoretically the process closing should close all of the associated sockets if they hadn't already been cleanly closed, which should terminate all of the connections.
@@ -36,8 +36,7 @@ func (m *mockNewDatabase) DeleteUser(ctx context.Context, req v5.DeleteUserReque | |||
} | |||
|
|||
func (m *mockNewDatabase) Type() (string, error) { | |||
args := m.Called() | |||
return args.String(0), args.Error(1) | |||
return "mock", nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we're hardcoding the return here?
I don't think we're currently doing this, but this wouldn't allow us to mock up errors from this method on tests if we needed to via something like mockDB.On("Type").Return("", errors.New("error from type"))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't, we have to add mockDB.On("Type")
to nearly every method or the tests will panic, since we use it in metrics collection.
Since this isn't used for much else that I saw, it seemed safer to just hardcode it here.
@@ -308,6 +327,7 @@ func (b *databaseBackend) clearConnection(name string) error { | |||
if ok { | |||
// Ignore error here since the database client is always killed | |||
db.Close() | |||
b.addConnectionsCounter(db.database, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically counters should only ever increase, I think you want a gauge instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I thought about using a Gauge, but then we'll have to do some synchronization and use atomics. I was hoping to keep it simple.
But you're right. I realize when I've done this in the past, I've used two counters -- one for increments and one for decrements, and had Datadog do the diff, or just used a gauge.
I'll switch to gauges.
// metrics collection | ||
var ( | ||
gaugeSync = sync.Mutex{} | ||
gauges = map[string]*int32{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always discomforted when I see non-readonly globals, can we pull this into databaseBackend instead? I'm also not sure we need a generic gauge manager, why not just have an atomic numConnections
field in databaseBackend? That way we won't have to worry about the lock to protect the map and any possible unpleasantness it might cause us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single numConnections
wouldn't allow us to see the different types of databases though.
I wanted to avoid putting these into databaseBackend
because we might have multiple databaseBackend
s running, and we probably would like to aggregate these gauges across all of them. With gauges (instead of counters), I don't think that armon/go-metrics
or statsd
will automatically aggregate them for us (whereas with counters, I believe statsd
at least would).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single numConnections wouldn't allow us to see the different types of databases though.
Good point.
I wanted to avoid putting these into databaseBackend because we might have multiple databaseBackends running, and we probably would like to aggregate these gauges across all of them
Maybe? I'm not sure. I don't have a lot of experience with the database backend. Is it common to run more than one instead of just using multiple configs? I guess on the one hand as a user I'd rather not have the values pre-aggregated, since my metrics store can sum the individual series together but it can't break up a value pre-aggregated by vault. On other hand, if people might have 100s or 1000s of db mounts, I don't want us to break telemetry by creating too many series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I would like to add a tag with the instance of the database backend -- maybe the mount path? But I don't see any way to get that. We could tag by the BackendUUID though.
We are seeing long pre-seal times when nodes are being shut down, which seems to be delays in releasing the global stateLock, which further seems to be delayed by cleaning up the builtin database plugin. The database plugin `Clean()` eventually calls, synchronously, `Close()` on every database instance that has been loaded. The database instance `Close()` calls can take a long time to complete (e.g., [mongo](https://github.com/hashicorp/vault/blob/v1.10.3/plugins/database/mongodb/connection_producer.go#L132) can take up to a minute to close its client before it will return). So, instead, we change the `Clean()` method to close all of the database instances in goroutines so they will be closed more quickly and won't block the release of the global stateLock. This should be safe because `Clean()` will only be called when the plugin is being unloaded, and so the connections will no longer be used. In addition, this adds metrics tracking how many of each type of databases are being used by the builtin database plugin.
Co-authored-by: Calvin Leung Huang <[email protected]>
|
||
// we will try to wait for all the connections to close | ||
// ... but not too long | ||
sem := semaphore.NewWeighted(int64(len(connectionsCopy))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a useful pattern that I have seen messed up a lot, and with a lot of bad answers for on the internet; I'm tempted to move this out to a library since I don't see one already made (what I want is something like a "TimedWaitGroup" or "WaitGroupWithTimeout".
Following up from discussions in #15923 and #15933, I wanted to split out a separate PR that drastically reduced the complexity of the use of the databaseBackend lock. We no longer need it at all for the `credRotationQueue`, and we can move it to be solely used in a few, small connections map management functions.
#15944 might be a better, simpler version of this. If it is approved and merged, I can close this and perhaps add a different PR for just adding metrics, the semaphore/waitgroup, or both. |
Cleanup and simplify lock usage in database plugin Following up from discussions in #15923 and #15933, I wanted to split out a separate PR that drastically reduced the complexity of the use of the databaseBackend lock. We no longer need it at all for the `credRotationQueue`, and we can move it to be solely used in a few, small connections map management functions. Co-authored-by: Calvin Leung Huang <[email protected]>
I'm going to close this and rework it for just metrics, I think. Thanks everyone! |
This is a replacement for #15923 that takes into account recent lock cleanup. I went ahead and added back in the hanging plugin test, which I meant to add in #15944 but forgot. I tested this by spinning up a statsd sink in the tests and verifying I got a stream of metrics: ``` $ nc -u -l 8125 | grep backend test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g ```
Add database plugin metrics around connections This is a replacement for #15923 that takes into account recent lock cleanup. I went ahead and added back in the hanging plugin test, which I meant to add in #15944 but forgot. I tested this by spinning up a statsd sink in the tests and verifying I got a stream of metrics: ``` $ nc -u -l 8125 | grep backend test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g ``` We have to rework the shared gauge code to work without a full `ClusterMetricSink`, since we don't have access to the core metrics from within a plugin. This only reports metrics every 10 minutes by default, but it solves some problems we would have had with the gauge values becoming stale and needing to be re-sent. Co-authored-by: Tom Proctor <[email protected]>
We are seeing long pre-seal times when nodes are being shut down, which
seems to be delays in releasing the global stateLock, which further
seems to be delayed by cleaning up the builtin database plugin.
The database plugin
Clean()
eventually calls, synchronously,Close()
on every database instance that has been loaded. The database instance
Close()
calls can take a long time to complete (e.g.,mongo
can take up to a minute to close its client before it will return).
So, instead, we change the
Clean()
method to close all of the databaseinstances in goroutines so they will be closed more quickly and won't
block the release of the global stateLock. This should be safe because
Clean()
will only be called when the plugin is being unloaded, and sothe connections will no longer be used.
In addition, this adds metrics tracking how many of each type of
databases are being used by the builtin database plugin.