When database plugin Clean is called, close connections in goroutines #15923

swenson · 2022-06-09T21:22:43Z

We are seeing long pre-seal times when nodes are being shut down, which
seems to be delays in releasing the global stateLock, which further
seems to be delayed by cleaning up the builtin database plugin.

The database plugin Clean() eventually calls, synchronously, Close()
on every database instance that has been loaded. The database instance
Close() calls can take a long time to complete (e.g.,
mongo
can take up to a minute to close its client before it will return).

So, instead, we change the Clean() method to close all of the database
instances in goroutines so they will be closed more quickly and won't
block the release of the global stateLock. This should be safe because
Clean() will only be called when the plugin is being unloaded, and so
the connections will no longer be used.

In addition, this adds metrics tracking how many of each type of
databases are being used by the builtin database plugin.

calvn · 2022-06-09T21:54:41Z

builtin/logical/database/backend.go

+	for _, db := range connectionsCopy {
+		go func(db *dbPluginInstance) {
+			b.addConnectionsCounter(db.database, -1)
+			db.Close()


For the case of shutdowns, since we're no longer waiting for Close to return could this lead to dangling connections on the database side since the vault process will exit before this finishes?

Hmm. I mean, theoretically the process closing should close all of the associated sockets if they hadn't already been cleanly closed, which should terminate all of the connections.

calvn · 2022-06-09T22:00:36Z

builtin/logical/database/mocks_test.go

@@ -36,8 +36,7 @@ func (m *mockNewDatabase) DeleteUser(ctx context.Context, req v5.DeleteUserReque
 }

 func (m *mockNewDatabase) Type() (string, error) {
-	args := m.Called()
-	return args.String(0), args.Error(1)
+	return "mock", nil


Is there a reason why we're hardcoding the return here?

I don't think we're currently doing this, but this wouldn't allow us to mock up errors from this method on tests if we needed to via something like mockDB.On("Type").Return("", errors.New("error from type")).

If we don't, we have to add mockDB.On("Type") to nearly every method or the tests will panic, since we use it in metrics collection.

Since this isn't used for much else that I saw, it seemed safer to just hardcode it here.

builtin/logical/database/backend.go

ncabatoff · 2022-06-10T12:06:39Z

builtin/logical/database/backend.go

@@ -308,6 +327,7 @@ func (b *databaseBackend) clearConnection(name string) error {
 	if ok {
 		// Ignore error here since the database client is always killed
 		db.Close()
+		b.addConnectionsCounter(db.database, -1)


Technically counters should only ever increase, I think you want a gauge instead.

Hmm. I thought about using a Gauge, but then we'll have to do some synchronization and use atomics. I was hoping to keep it simple.

But you're right. I realize when I've done this in the past, I've used two counters -- one for increments and one for decrements, and had Datadog do the diff, or just used a gauge.

I'll switch to gauges.

ncabatoff · 2022-06-10T19:18:39Z

builtin/logical/database/backend.go

+// metrics collection
+var (
+	gaugeSync = sync.Mutex{}
+	gauges    = map[string]*int32{}


I'm always discomforted when I see non-readonly globals, can we pull this into databaseBackend instead? I'm also not sure we need a generic gauge manager, why not just have an atomic numConnections field in databaseBackend? That way we won't have to worry about the lock to protect the map and any possible unpleasantness it might cause us.

A single numConnections wouldn't allow us to see the different types of databases though.

I wanted to avoid putting these into databaseBackend because we might have multiple databaseBackends running, and we probably would like to aggregate these gauges across all of them. With gauges (instead of counters), I don't think that armon/go-metrics or statsd will automatically aggregate them for us (whereas with counters, I believe statsd at least would).

A single numConnections wouldn't allow us to see the different types of databases though.

Good point.

I wanted to avoid putting these into databaseBackend because we might have multiple databaseBackends running, and we probably would like to aggregate these gauges across all of them

Maybe? I'm not sure. I don't have a lot of experience with the database backend. Is it common to run more than one instead of just using multiple configs? I guess on the one hand as a user I'd rather not have the values pre-aggregated, since my metrics store can sum the individual series together but it can't break up a value pre-aggregated by vault. On other hand, if people might have 100s or 1000s of db mounts, I don't want us to break telemetry by creating too many series.

Ideally, I would like to add a tag with the instance of the database backend -- maybe the mount path? But I don't see any way to get that. We could tag by the BackendUUID though.

We are seeing long pre-seal times when nodes are being shut down, which seems to be delays in releasing the global stateLock, which further seems to be delayed by cleaning up the builtin database plugin. The database plugin `Clean()` eventually calls, synchronously, `Close()` on every database instance that has been loaded. The database instance `Close()` calls can take a long time to complete (e.g., [mongo](https://github.com/hashicorp/vault/blob/v1.10.3/plugins/database/mongodb/connection_producer.go#L132) can take up to a minute to close its client before it will return). So, instead, we change the `Clean()` method to close all of the database instances in goroutines so they will be closed more quickly and won't block the release of the global stateLock. This should be safe because `Clean()` will only be called when the plugin is being unloaded, and so the connections will no longer be used. In addition, this adds metrics tracking how many of each type of databases are being used by the builtin database plugin.

Co-authored-by: Calvin Leung Huang <[email protected]>

Use a semaphore to attempt to wait for the backend plugins to finish, but don't wait too long. We can't use a WaitGroup (easily) because it doesn't allow us to pass a context or specify a tiemout.

swenson · 2022-06-10T22:19:28Z

builtin/logical/database/backend.go

+
+	// we will try to wait for all the connections to close
+	// ... but not too long
+	sem := semaphore.NewWeighted(int64(len(connectionsCopy)))


This seems like a useful pattern that I have seen messed up a lot, and with a lot of bad answers for on the internet; I'm tempted to move this out to a library since I don't see one already made (what I want is something like a "TimedWaitGroup" or "WaitGroupWithTimeout".

Following up from discussions in #15923 and #15933, I wanted to split out a separate PR that drastically reduced the complexity of the use of the databaseBackend lock. We no longer need it at all for the `credRotationQueue`, and we can move it to be solely used in a few, small connections map management functions.

swenson · 2022-06-10T23:27:33Z

#15944 might be a better, simpler version of this. If it is approved and merged, I can close this and perhaps add a different PR for just adding metrics, the semaphore/waitgroup, or both.

Cleanup and simplify lock usage in database plugin Following up from discussions in #15923 and #15933, I wanted to split out a separate PR that drastically reduced the complexity of the use of the databaseBackend lock. We no longer need it at all for the `credRotationQueue`, and we can move it to be solely used in a few, small connections map management functions. Co-authored-by: Calvin Leung Huang <[email protected]>

swenson · 2022-06-17T17:22:36Z

I'm going to close this and rework it for just metrics, I think. Thanks everyone!

This is a replacement for #15923 that takes into account recent lock cleanup. I went ahead and added back in the hanging plugin test, which I meant to add in #15944 but forgot. I tested this by spinning up a statsd sink in the tests and verifying I got a stream of metrics: ``` $ nc -u -l 8125 | grep backend test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g ```

Add database plugin metrics around connections This is a replacement for #15923 that takes into account recent lock cleanup. I went ahead and added back in the hanging plugin test, which I meant to add in #15944 but forgot. I tested this by spinning up a statsd sink in the tests and verifying I got a stream of metrics: ``` $ nc -u -l 8125 | grep backend test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g ``` We have to rework the shared gauge code to work without a full `ClusterMetricSink`, since we don't have access to the core metrics from within a plugin. This only reports metrics every 10 minutes by default, but it solves some problems we would have had with the gauge values becoming stale and needing to be re-sent. Co-authored-by: Tom Proctor <[email protected]>

swenson requested review from calvn, ncabatoff, austingebauer and a team June 9, 2022 21:22

swenson added the pr/no-changelog label Jun 9, 2022

calvn reviewed Jun 9, 2022

View reviewed changes

builtin/logical/database/backend.go Outdated Show resolved Hide resolved

ncabatoff reviewed Jun 10, 2022

View reviewed changes

Christopher Swenson and others added 6 commits June 10, 2022 13:42

Set type of mock plugin in setupMockDB

389eba4

Update builtin/logical/database/backend.go

de3c0e7

Co-authored-by: Calvin Leung Huang <[email protected]>

format

32cd264

Add tests

8d0c9be

Use a semaphore to attempt to wait for the backend plugins to finish, but don't wait too long. We can't use a WaitGroup (easily) because it doesn't allow us to pass a context or specify a tiemout.

Remove unnecessary type conversion

e198f86

swenson force-pushed the db-close branch from 1c1c256 to e198f86 Compare June 10, 2022 20:43

Prevent hanging plugin from being run as a normal test

2b91594

swenson commented Jun 10, 2022

View reviewed changes

swenson mentioned this pull request Jun 10, 2022

Cleanup and simplify lock usage in database plugin #15944

Merged

swenson closed this Jun 17, 2022

swenson deleted the db-close branch June 17, 2022 17:22

swenson mentioned this pull request Jun 17, 2022

Add database plugin metrics around connections #16048

Merged

hc-github-team-secure-vault-core mentioned this pull request Jun 27, 2022

Backport of Add database plugin metrics around connections into release/1.11.x #16155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When database plugin Clean is called, close connections in goroutines #15923

When database plugin Clean is called, close connections in goroutines #15923

swenson commented Jun 9, 2022

calvn Jun 9, 2022

swenson Jun 9, 2022

calvn Jun 9, 2022

swenson Jun 9, 2022

ncabatoff Jun 10, 2022

swenson Jun 10, 2022

ncabatoff Jun 10, 2022

swenson Jun 10, 2022

ncabatoff Jun 13, 2022

swenson Jun 13, 2022

swenson Jun 10, 2022

swenson commented Jun 10, 2022

swenson commented Jun 17, 2022

When database plugin Clean is called, close connections in goroutines #15923

When database plugin Clean is called, close connections in goroutines #15923

Conversation

swenson commented Jun 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swenson commented Jun 10, 2022

swenson commented Jun 17, 2022