Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: implement a background task for temp object cleanup #45669

Merged

Conversation

otan
Copy link
Contributor

@otan otan commented Mar 3, 2020

Resolves #45666

This PR introduces a TempObjectCleaner, which is a background task
that runs periodically on every node, but will only be executed by the
node that has the meta1 leaseholder. This task will cleanup dangling
temporary schemas that could not have been cleaned up by the connection.
The job also comes with logging and metrics. A simple test mocking the
ticking mechanism has also been included.

Also refactored the existing temporary schema deletion to use ExecEx,
which overrides the SearchPath / User executing the deletion at a lower
layer rather than using SetSessionExecutor.

Release note (sql change): Introduced a temporary table cleanup job that
runs once every periodically per cluster that will remove any temporary
schemas and their related objects that did not get removed cleanly when
the connection closed. This period can be changed by the cluster setting
sql.temp_object_cleaner.cleanup_interval, which defaults to 30min-ly.

@otan otan requested review from jordanlewis and a team March 3, 2020 21:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@otan otan force-pushed the temp_schema_cleanup_boogaloo branch from d6a8b20 to 84b98c2 Compare March 3, 2020 23:04
@jordanlewis jordanlewis mentioned this pull request Mar 4, 2020
83 tasks
@otan otan force-pushed the temp_schema_cleanup_boogaloo branch 6 times, most recently from 8c5e449 to 15cbf16 Compare March 4, 2020 19:56
@otan
Copy link
Contributor Author

otan commented Mar 6, 2020

@jordanlewis RFAL if you have time :)

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly :lgtm: though having a stamp from someone who proposed the meta1 thing would be good!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)


pkg/sql/temporary_schema.go, line 409 at r1 (raw file):

	// We only want to perform the cleanup if we are holding the meta1 lease.
	// This ensures only one server can perform the job at a time.

No idea if this is legit or not. I think it would be nice to get a review from whoever suggested this idea as well :)


pkg/sql/temporary_schema.go, line 483 at r1 (raw file):

	for sessionID := range sessionIDs {
		if _, ok := activeSessions[sessionID.Uint128]; !ok {
			log.Infof(ctx, "cleaning up temporary object for session %q", sessionID)

I'd recommend just getting rid of this one, as in the happy case of cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable log lines. I'd keep the "starting" and "finishing" ones, and the error ones, and add "cleaned up %d/%d" sessions to the "finishing" log.


pkg/sql/temporary_schema.go, line 484 at r1 (raw file):

		if _, ok := activeSessions[sessionID.Uint128]; !ok {
			log.Infof(ctx, "cleaning up temporary object for session %q", sessionID)
			c.metrics.schemasToDelete.Inc(1)

IF you never decrement it, wont this look kinda scary on the timeseries graphs? The name makes it sound like it's a queue. Maybe it should be a gauge.


pkg/sql/sqlbase/internal.go, line 24 at r1 (raw file):

	// ApplicationName represents the application that the query runs under.
	ApplicationName string
	// SearchPath represents the namespsaces to search in.

nit: typo in namespaces

@otan
Copy link
Contributor Author

otan commented Mar 6, 2020 via email

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 10 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)


pkg/sql/temporary_schema.go, line 45 at r1 (raw file):

// defaultTempObjectCleanupInterval is the interval of time after which a background job
// runs to cleanup temporary tables that weren't deleted at session exit.
var defaultTempObjectCleanupInterval = 30 * time.Minute

smells like a reserved cluster setting.


pkg/sql/temporary_schema.go, line 334 at r1 (raw file):

	metrics                          *temporaryObjectCleanerMetrics

	sleepFunc func(time.Duration)

Is this ever used?


pkg/sql/temporary_schema.go, line 362 at r1 (raw file):

		isMeta1LeaseholderFunc:           isMeta1LeaseholderFunc,
		testingKnobs:                     testingKnobs,
		metrics: &temporaryObjectCleanerMetrics{

I don't think you've got these metrics hooked up. Once you do I think you'll be dismayed by the linter's strong-arming you to describe there metrics in https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/catalog/chart_catalog.go.

In order to get it hooked up you need to:

  1. Implement https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/metric#Struct on temporaryObjectCleanerMetrics
  2. Provide a method to get the metrics:
func (c *TemporaryObjectCleaner) Metrics() metric.Struct { return &c.metrics }
  1. export the fields in the metrics struct
  2. register the metric struct in the registry in server.go

nit: there's a pattern generally throughout the codebase to:

  • have the metrics struct as a non-pointer field in its parent
  • have a make...Metrics function that constructs the metrics struct
  • store the meta in a var block // I'm not so attached to this but if you look around, it's the pattern

pkg/sql/temporary_schema.go, line 363 at r1 (raw file):

		testingKnobs:                     testingKnobs,
		metrics: &temporaryObjectCleanerMetrics{
			jobRunning: metric.NewGauge(metric.Metadata{

This feels like a thing I'd want to keep as a counter that just increases monotonically. I can see being torn about it. If anything though I'd make the below one a guage before this one. We have good rate/delta computation

It's moments like these where I wish our timeseries database was either more sophisticated or didn't exist. Always incrementing counters which you can subtract from eachother and are known to be sampled at the same time are a really nice metric primitive. We lack this operation in our timeseries database. It's probably pretty easy to add. Anyway, prometheus offers it.

The nice thing about having counters is it also gives you rates. Not that this is a thing that's going to have any interesting pattern in its rate.


pkg/sql/temporary_schema.go, line 409 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

No idea if this is legit or not. I think it would be nice to get a review from whoever suggested this idea as well :)

This is legit.


pkg/sql/temporary_schema.go, line 411 at r1 (raw file):

	// This ensures only one server can perform the job at a time.
	var isLeaseholder bool
	if err := retryFunc(ctx, func() error {

did you find the retryFunc handy here for tests or something? In general I don't think you need to be retrying this. A lot of the errors will be due to things like shutting down. Leases don't more around very often. I think I'd rather increase the interval than increase the retries here.


pkg/sql/temporary_schema.go, line 483 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I'd recommend just getting rid of this one, as in the happy case of cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable log lines. I'd keep the "starting" and "finishing" ones, and the error ones, and add "cleaned up %d/%d" sessions to the "finishing" log.

VEventf is cool. Andrei has been pushing a testing methodology whereby you can examine what happens in the execution of calls by inspecting what got passed to their traces. I like it because it's less verbose than if log.V(...) { log.Info(...) } but I'm interest in seeing how sprinkling tracing throughout the code will come in handy later.


pkg/sql/temporary_schema.go, line 484 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

IF you never decrement it, wont this look kinda scary on the timeseries graphs? The name makes it sound like it's a queue. Maybe it should be a gauge.

This one does feels weird as a counter given what you've made a guage in terms of counting the job runs.


pkg/sql/temporary_schema.go, line 489 at r1 (raw file):

			// the given schema correctly.
			if err := retryFunc(ctx, func() error {
				return cleanupSessionTempObjects(

This code doesn't seem to log or record a metric in the error path. I get that you're retrying but in the face of some errors but any retries are interesting. Perhaps a metric that you pass to your retryFunc to count retries and a log.Every on the cleaner and then an if err != nil && c.logEvery.ShouldLog() { log.InfofDepth(ctx, 1, ...) }`

@otan otan force-pushed the temp_schema_cleanup_boogaloo branch from 15cbf16 to dceef05 Compare March 6, 2020 19:11
Copy link
Contributor Author

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @jordanlewis)


pkg/sql/temporary_schema.go, line 45 at r1 (raw file):

Previously, ajwerner wrote…

smells like a reserved cluster setting.

Done.


pkg/sql/temporary_schema.go, line 334 at r1 (raw file):

Previously, ajwerner wrote…

Is this ever used?

Nope - relic from previous retry library. Fixed!


pkg/sql/temporary_schema.go, line 362 at r1 (raw file):

Previously, ajwerner wrote…

I don't think you've got these metrics hooked up. Once you do I think you'll be dismayed by the linter's strong-arming you to describe there metrics in https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/catalog/chart_catalog.go.

In order to get it hooked up you need to:

  1. Implement https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/metric#Struct on temporaryObjectCleanerMetrics
  2. Provide a method to get the metrics:
func (c *TemporaryObjectCleaner) Metrics() metric.Struct { return &c.metrics }
  1. export the fields in the metrics struct
  2. register the metric struct in the registry in server.go

nit: there's a pattern generally throughout the codebase to:

  • have the metrics struct as a non-pointer field in its parent
  • have a make...Metrics function that constructs the metrics struct
  • store the meta in a var block // I'm not so attached to this but if you look around, it's the pattern

Done.


pkg/sql/temporary_schema.go, line 363 at r1 (raw file):

Previously, ajwerner wrote…

This feels like a thing I'd want to keep as a counter that just increases monotonically. I can see being torn about it. If anything though I'd make the below one a guage before this one. We have good rate/delta computation

It's moments like these where I wish our timeseries database was either more sophisticated or didn't exist. Always incrementing counters which you can subtract from eachother and are known to be sampled at the same time are a really nice metric primitive. We lack this operation in our timeseries database. It's probably pretty easy to add. Anyway, prometheus offers it.

The nice thing about having counters is it also gives you rates. Not that this is a thing that's going to have any interesting pattern in its rate.

I wanted to keep this as a measure for when the cleaner is running, and when the cleaner is not running (i.e. 1=active, 0=not active) -- it's kind of a practice I've had from my old place which was nice to read.


pkg/sql/temporary_schema.go, line 409 at r1 (raw file):

Previously, ajwerner wrote…

This is legit.

Done.


pkg/sql/temporary_schema.go, line 411 at r1 (raw file):

Previously, ajwerner wrote…

did you find the retryFunc handy here for tests or something? In general I don't think you need to be retrying this. A lot of the errors will be due to things like shutting down. Leases don't more around very often. I think I'd rather increase the interval than increase the retries here.

Nope, just added retries to everything. I can remove this one. Done.


pkg/sql/temporary_schema.go, line 483 at r1 (raw file):

Previously, ajwerner wrote…

VEventf is cool. Andrei has been pushing a testing methodology whereby you can examine what happens in the execution of calls by inspecting what got passed to their traces. I like it because it's less verbose than if log.V(...) { log.Info(...) } but I'm interest in seeing how sprinkling tracing throughout the code will come in handy later.

Done.


pkg/sql/temporary_schema.go, line 484 at r1 (raw file):

Previously, ajwerner wrote…

This one does feels weird as a counter given what you've made a guage in terms of counting the job runs.

I was thinking this should actively count how many have been deleted. More accurately, I care about (delta)/time period so I can see things like "deletes per second". Is this not what counter is? Sorry, haven't used this metric system much and this is what I've interpreted it to be able to do.


pkg/sql/temporary_schema.go, line 489 at r1 (raw file):

Previously, ajwerner wrote…

This code doesn't seem to log or record a metric in the error path. I get that you're retrying but in the face of some errors but any retries are interesting. Perhaps a metric that you pass to your retryFunc to count retries and a log.Every on the cleaner and then an if err != nil && c.logEvery.ShouldLog() { log.InfofDepth(ctx, 1, ...) }`

I've decided to just log every error during a retry in the retryFunc.

@jordanlewis
Copy link
Member

I think there is something wrong going on. Seen in one of the test failure outputs:

I200306 19:56:58.975808 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.976165 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.976944 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.977016 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.977571 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.978346 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.978482 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.979114 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.980437 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.980556 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.981144 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.982078 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.982134 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job
I200306 19:56:58.982741 504746 sql/temporary_schema.go:482  [n1] found 0 temporary schemas
I200306 19:56:58.983904 504746 sql/temporary_schema.go:531  [n1] completed temporary object cleanup job
I200306 19:56:58.983971 504746 sql/temporary_schema.go:448  [n1] running temporary object cleanup background job

@otan otan force-pushed the temp_schema_cleanup_boogaloo branch 3 times, most recently from 7f28db7 to 04b1121 Compare March 7, 2020 07:12
@otan
Copy link
Contributor Author

otan commented Mar 7, 2020

woops, fixed!

@otan otan force-pushed the temp_schema_cleanup_boogaloo branch 2 times, most recently from e1091e8 to 83e4b29 Compare March 9, 2020 16:51
@otan
Copy link
Contributor Author

otan commented Mar 9, 2020

there was a data race error, i have PR #45896 for it for now.

@otan otan force-pushed the temp_schema_cleanup_boogaloo branch 3 times, most recently from 9b9caad to 7587b6e Compare March 9, 2020 18:29
@otan
Copy link
Contributor Author

otan commented Mar 9, 2020

the data race error seems blocking... i will wait for that to resolve.

This PR introduces a `TempObjectCleaner`, which is a background task
that runs periodically on every node, but will only be executed by the
node that has the meta1 leaseholder. This task will cleanup dangling
temporary schemas that could not have been cleaned up by the connection.
The job also comes with logging and metrics. A simple test mocking the
ticking mechanism has also been included.

Also refactored the existing temporary schema deletion to use ExecEx,
which overrides the SearchPath / User executing the deletion at a lower
layer rather than using `SetSessionExecutor`.

Release note (sql change): Introduced a temporary table cleanup job that
runs once every periodically per cluster that will remove any temporary
schemas and their related objects that did not get removed cleanly when
the connection closed. This period can be changed by the cluster setting
`sql.temp_object_cleaner.cleanup_interval`, which defaults to 30min-ly.
@otan otan force-pushed the temp_schema_cleanup_boogaloo branch from 7587b6e to 34b4740 Compare March 9, 2020 20:37
@otan otan merged commit c28acc2 into cockroachdb:master Mar 9, 2020
@otan otan deleted the temp_schema_cleanup_boogaloo branch March 9, 2020 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cleanup loose hanging temporary tables
4 participants