Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When database plugin Clean is called, close connections in goroutines #15923

Closed
wants to merge 7 commits into from

Conversation

swenson
Copy link
Contributor

@swenson swenson commented Jun 9, 2022

We are seeing long pre-seal times when nodes are being shut down, which
seems to be delays in releasing the global stateLock, which further
seems to be delayed by cleaning up the builtin database plugin.

The database plugin Clean() eventually calls, synchronously, Close()
on every database instance that has been loaded. The database instance
Close() calls can take a long time to complete (e.g.,
mongo
can take up to a minute to close its client before it will return).

So, instead, we change the Clean() method to close all of the database
instances in goroutines so they will be closed more quickly and won't
block the release of the global stateLock. This should be safe because
Clean() will only be called when the plugin is being unloaded, and so
the connections will no longer be used.

In addition, this adds metrics tracking how many of each type of
databases are being used by the builtin database plugin.

@swenson swenson requested review from calvn, ncabatoff, austingebauer and a team June 9, 2022 21:22
for _, db := range connectionsCopy {
go func(db *dbPluginInstance) {
b.addConnectionsCounter(db.database, -1)
db.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case of shutdowns, since we're no longer waiting for Close to return could this lead to dangling connections on the database side since the vault process will exit before this finishes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I mean, theoretically the process closing should close all of the associated sockets if they hadn't already been cleanly closed, which should terminate all of the connections.

@@ -36,8 +36,7 @@ func (m *mockNewDatabase) DeleteUser(ctx context.Context, req v5.DeleteUserReque
}

func (m *mockNewDatabase) Type() (string, error) {
args := m.Called()
return args.String(0), args.Error(1)
return "mock", nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we're hardcoding the return here?

I don't think we're currently doing this, but this wouldn't allow us to mock up errors from this method on tests if we needed to via something like mockDB.On("Type").Return("", errors.New("error from type")).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't, we have to add mockDB.On("Type") to nearly every method or the tests will panic, since we use it in metrics collection.

Since this isn't used for much else that I saw, it seemed safer to just hardcode it here.

@@ -308,6 +327,7 @@ func (b *databaseBackend) clearConnection(name string) error {
if ok {
// Ignore error here since the database client is always killed
db.Close()
b.addConnectionsCounter(db.database, -1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically counters should only ever increase, I think you want a gauge instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought about using a Gauge, but then we'll have to do some synchronization and use atomics. I was hoping to keep it simple.

But you're right. I realize when I've done this in the past, I've used two counters -- one for increments and one for decrements, and had Datadog do the diff, or just used a gauge.

I'll switch to gauges.

// metrics collection
var (
gaugeSync = sync.Mutex{}
gauges = map[string]*int32{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always discomforted when I see non-readonly globals, can we pull this into databaseBackend instead? I'm also not sure we need a generic gauge manager, why not just have an atomic numConnections field in databaseBackend? That way we won't have to worry about the lock to protect the map and any possible unpleasantness it might cause us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single numConnections wouldn't allow us to see the different types of databases though.

I wanted to avoid putting these into databaseBackend because we might have multiple databaseBackends running, and we probably would like to aggregate these gauges across all of them. With gauges (instead of counters), I don't think that armon/go-metrics or statsd will automatically aggregate them for us (whereas with counters, I believe statsd at least would).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single numConnections wouldn't allow us to see the different types of databases though.

Good point.

I wanted to avoid putting these into databaseBackend because we might have multiple databaseBackends running, and we probably would like to aggregate these gauges across all of them

Maybe? I'm not sure. I don't have a lot of experience with the database backend. Is it common to run more than one instead of just using multiple configs? I guess on the one hand as a user I'd rather not have the values pre-aggregated, since my metrics store can sum the individual series together but it can't break up a value pre-aggregated by vault. On other hand, if people might have 100s or 1000s of db mounts, I don't want us to break telemetry by creating too many series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I would like to add a tag with the instance of the database backend -- maybe the mount path? But I don't see any way to get that. We could tag by the BackendUUID though.

Christopher Swenson and others added 6 commits June 10, 2022 13:42
We are seeing long pre-seal times when nodes are being shut down, which
seems to be delays in releasing the global stateLock, which further
seems to be delayed by cleaning up the builtin database plugin.

The database plugin `Clean()` eventually calls, synchronously, `Close()`
on every database instance that has been loaded. The database instance
`Close()` calls can take a long time to complete (e.g.,
[mongo](https://github.com/hashicorp/vault/blob/v1.10.3/plugins/database/mongodb/connection_producer.go#L132)
can take up to a minute to close its client before it will return).

So, instead, we change the `Clean()` method to close all of the database
instances in goroutines so they will be closed more quickly and won't
block the release of the global stateLock. This should be safe because
`Clean()` will only be called when the plugin is being unloaded, and so
the connections will no longer be used.

In addition, this adds metrics tracking how many of each type of
databases are being used by the builtin database plugin.
Use a semaphore to attempt to wait for the backend plugins to finish,
but don't wait too long.

We can't use a WaitGroup (easily) because it doesn't allow us to pass a
context or specify a tiemout.

// we will try to wait for all the connections to close
// ... but not too long
sem := semaphore.NewWeighted(int64(len(connectionsCopy)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a useful pattern that I have seen messed up a lot, and with a lot of bad answers for on the internet; I'm tempted to move this out to a library since I don't see one already made (what I want is something like a "TimedWaitGroup" or "WaitGroupWithTimeout".

swenson pushed a commit that referenced this pull request Jun 10, 2022
Following up from discussions in #15923 and #15933, I wanted to split
out a separate PR that drastically reduced the complexity of the use of
the databaseBackend lock. We no longer need it at all for the
`credRotationQueue`, and we can move it to be solely used in a few,
small connections map management functions.
@swenson
Copy link
Contributor Author

swenson commented Jun 10, 2022

#15944 might be a better, simpler version of this. If it is approved and merged, I can close this and perhaps add a different PR for just adding metrics, the semaphore/waitgroup, or both.

swenson added a commit that referenced this pull request Jun 17, 2022
Cleanup and simplify lock usage in database plugin

Following up from discussions in #15923 and #15933, I wanted to split
out a separate PR that drastically reduced the complexity of the use of
the databaseBackend lock. We no longer need it at all for the
`credRotationQueue`, and we can move it to be solely used in a few,
small connections map management functions.

Co-authored-by: Calvin Leung Huang <[email protected]>
@swenson
Copy link
Contributor Author

swenson commented Jun 17, 2022

I'm going to close this and rework it for just metrics, I think. Thanks everyone!

@swenson swenson closed this Jun 17, 2022
@swenson swenson deleted the db-close branch June 17, 2022 17:22
swenson pushed a commit that referenced this pull request Jun 17, 2022
This is a replacement for #15923 that takes into account recent lock
cleanup.

I went ahead and added back in the hanging plugin test, which I meant to
add in #15944 but forgot.

I tested this by spinning up a statsd sink in the tests and verifying I
got a stream of metrics:

```
$ nc -u -l 8125 | grep backend
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g
```
swenson added a commit that referenced this pull request Jun 27, 2022
Add database plugin metrics around connections

This is a replacement for #15923 that takes into account recent lock
cleanup.

I went ahead and added back in the hanging plugin test, which I meant to
add in #15944 but forgot.

I tested this by spinning up a statsd sink in the tests and verifying I
got a stream of metrics:

```
$ nc -u -l 8125 | grep backend
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:1.000000|g
test.swenson-Q9Q0L72D39.secrets.database.backend.connections.count.pgx.5.:0.000000|g
```

We have to rework the shared gauge code to work without a full
`ClusterMetricSink`, since we don't have access to the core metrics from
within a plugin.

This only reports metrics every 10 minutes by default, but it solves
some problems we would have had with the gauge values becoming stale and
needing to be re-sent.

Co-authored-by: Tom Proctor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants