-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: implement a background task for temp object cleanup #45669
sql: implement a background task for temp object cleanup #45669
Conversation
d6a8b20
to
84b98c2
Compare
8c5e449
to
15cbf16
Compare
@jordanlewis RFAL if you have time :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly though having a stamp from someone who proposed the meta1 thing would be good!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)
pkg/sql/temporary_schema.go, line 409 at r1 (raw file):
// We only want to perform the cleanup if we are holding the meta1 lease. // This ensures only one server can perform the job at a time.
No idea if this is legit or not. I think it would be nice to get a review from whoever suggested this idea as well :)
pkg/sql/temporary_schema.go, line 483 at r1 (raw file):
for sessionID := range sessionIDs { if _, ok := activeSessions[sessionID.Uint128]; !ok { log.Infof(ctx, "cleaning up temporary object for session %q", sessionID)
I'd recommend just getting rid of this one, as in the happy case of cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable log lines. I'd keep the "starting" and "finishing" ones, and the error ones, and add "cleaned up %d/%d" sessions to the "finishing" log.
pkg/sql/temporary_schema.go, line 484 at r1 (raw file):
if _, ok := activeSessions[sessionID.Uint128]; !ok { log.Infof(ctx, "cleaning up temporary object for session %q", sessionID) c.metrics.schemasToDelete.Inc(1)
IF you never decrement it, wont this look kinda scary on the timeseries graphs? The name makes it sound like it's a queue. Maybe it should be a gauge.
pkg/sql/sqlbase/internal.go, line 24 at r1 (raw file):
// ApplicationName represents the application that the query runs under. ApplicationName string // SearchPath represents the namespsaces to search in.
nit: typo in namespaces
cc, @ajwerner mind taking a look at the meta1 usage?
…On Thu, 5 Mar 2020, 5:23 pm Jordan Lewis, ***@***.***> wrote:
***@***.**** approved this pull request.
Mostly [image: <img class="emoji" title=":lgtm:" alt=":lgtm:" align="absmiddle" src="https://reviewable.io/lgtm.png" height="20" width="61"/>]
<https://camo.githubusercontent.com/41b3ec3419116e3b3f84507ad965646ce4ce1f0c/68747470733a2f2f72657669657761626c652e696f2f6c67746d2e706e67>
though having a stamp from someone who proposed the meta1 thing would be
good!
*Reviewable
<https://reviewable.io/reviews/cockroachdb/cockroach/45669#-:-M1hR90O2xe85-o8HLOR:bkr4hhy>*
status: [image: ] complete! 1 of 0 LGTMs obtained (waiting on
@jordanlewis <https://github.com/jordanlewis> and @otan
<https://github.com/otan>)
------------------------------
*pkg/sql/temporary_schema.go, line 409 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hPKXRA3ETOHi1D2M4:-M1hPKXRA3ETOHi1D2M5:b-q0m63x>
(raw file
<https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/temporary_schema.go#L409>):*
// We only want to perform the cleanup if we are holding the meta1 lease.
// This ensures only one server can perform the job at a time.
No idea if this is legit or not. I think it would be nice to get a review
from whoever suggested this idea as well :)
------------------------------
*pkg/sql/temporary_schema.go, line 483 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hQBTr9SBqdP1Oor3X:-M1hQBTr9SBqdP1Oor3Y:b-c11yvb>
(raw file
<https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/temporary_schema.go#L483>):*
for sessionID := range sessionIDs {
if _, ok := activeSessions[sessionID.Uint128]; !ok {
log.Infof(ctx, "cleaning up temporary object for session %q", sessionID)
I'd recommend just getting rid of this one, as in the happy case of
cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable
log lines. I'd keep the "starting" and "finishing" ones, and the error
ones, and add "cleaned up %d/%d" sessions to the "finishing" log.
------------------------------
*pkg/sql/temporary_schema.go, line 484 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hP2MUEi0KYqssgLVa:-M1hP2MUEi0KYqssgLVb:bb7dd81>
(raw file
<https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/temporary_schema.go#L484>):*
if _, ok := activeSessions[sessionID.Uint128]; !ok {
log.Infof(ctx, "cleaning up temporary object for session %q", sessionID)
c.metrics.schemasToDelete.Inc(1)
IF you never decrement it, wont this look kinda scary on the timeseries
graphs? The name makes it sound like it's a queue. Maybe it should be a
gauge.
------------------------------
*pkg/sql/sqlbase/internal.go, line 24 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/45669#-M1hO1jW2QCfT0m_Utra:-M1hO1jW2QCfT0m_Utrb:b-9snu9b>
(raw file
<https://github.com/cockroachdb/cockroach/blob/15cbf166220ccb23900f6499450cc167fca7a08e/pkg/sql/sqlbase/internal.go#L24>):*
// ApplicationName represents the application that the query runs under.
ApplicationName string
// SearchPath represents the namespsaces to search in.
nit: typo in namespaces
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#45669?email_source=notifications&email_token=AA32FQ4TGGKB2RR4LEJ2WSLRGBGCLA5CNFSM4LATWKBKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCYHC6ZQ#pullrequestreview-370028390>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA32FQ4WOWSND66ZNVWVO6TRGBGCLANCNFSM4LATWKBA>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 10 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)
pkg/sql/temporary_schema.go, line 45 at r1 (raw file):
// defaultTempObjectCleanupInterval is the interval of time after which a background job // runs to cleanup temporary tables that weren't deleted at session exit. var defaultTempObjectCleanupInterval = 30 * time.Minute
smells like a reserved cluster setting.
pkg/sql/temporary_schema.go, line 334 at r1 (raw file):
metrics *temporaryObjectCleanerMetrics sleepFunc func(time.Duration)
Is this ever used?
pkg/sql/temporary_schema.go, line 362 at r1 (raw file):
isMeta1LeaseholderFunc: isMeta1LeaseholderFunc, testingKnobs: testingKnobs, metrics: &temporaryObjectCleanerMetrics{
I don't think you've got these metrics hooked up. Once you do I think you'll be dismayed by the linter's strong-arming you to describe there metrics in https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/catalog/chart_catalog.go.
In order to get it hooked up you need to:
- Implement https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/metric#Struct on
temporaryObjectCleanerMetrics
- Provide a method to get the metrics:
func (c *TemporaryObjectCleaner) Metrics() metric.Struct { return &c.metrics }
- export the fields in the metrics struct
- register the metric struct in the registry in server.go
nit: there's a pattern generally throughout the codebase to:
- have the metrics struct as a non-pointer field in its parent
- have a
make...Metrics
function that constructs the metrics struct - store the meta in a var block // I'm not so attached to this but if you look around, it's the pattern
pkg/sql/temporary_schema.go, line 363 at r1 (raw file):
testingKnobs: testingKnobs, metrics: &temporaryObjectCleanerMetrics{ jobRunning: metric.NewGauge(metric.Metadata{
This feels like a thing I'd want to keep as a counter that just increases monotonically. I can see being torn about it. If anything though I'd make the below one a guage before this one. We have good rate/delta computation
It's moments like these where I wish our timeseries database was either more sophisticated or didn't exist. Always incrementing counters which you can subtract from eachother and are known to be sampled at the same time are a really nice metric primitive. We lack this operation in our timeseries database. It's probably pretty easy to add. Anyway, prometheus offers it.
The nice thing about having counters is it also gives you rates. Not that this is a thing that's going to have any interesting pattern in its rate.
pkg/sql/temporary_schema.go, line 409 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
No idea if this is legit or not. I think it would be nice to get a review from whoever suggested this idea as well :)
This is legit.
pkg/sql/temporary_schema.go, line 411 at r1 (raw file):
// This ensures only one server can perform the job at a time. var isLeaseholder bool if err := retryFunc(ctx, func() error {
did you find the retryFunc
handy here for tests or something? In general I don't think you need to be retrying this. A lot of the errors will be due to things like shutting down. Leases don't more around very often. I think I'd rather increase the interval than increase the retries here.
pkg/sql/temporary_schema.go, line 483 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I'd recommend just getting rid of this one, as in the happy case of cleaning up, say 1000 sessions, we'll get 1000 pointless and unactionable log lines. I'd keep the "starting" and "finishing" ones, and the error ones, and add "cleaned up %d/%d" sessions to the "finishing" log.
VEventf
is cool. Andrei has been pushing a testing methodology whereby you can examine what happens in the execution of calls by inspecting what got passed to their traces. I like it because it's less verbose than if log.V(...) { log.Info(...) }
but I'm interest in seeing how sprinkling tracing throughout the code will come in handy later.
pkg/sql/temporary_schema.go, line 484 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
IF you never decrement it, wont this look kinda scary on the timeseries graphs? The name makes it sound like it's a queue. Maybe it should be a gauge.
This one does feels weird as a counter given what you've made a guage in terms of counting the job runs.
pkg/sql/temporary_schema.go, line 489 at r1 (raw file):
// the given schema correctly. if err := retryFunc(ctx, func() error { return cleanupSessionTempObjects(
This code doesn't seem to log or record a metric in the error path. I get that you're retrying but in the face of some errors but any retries are interesting. Perhaps a metric that you pass to your retryFunc
to count retries and a log.Every
on the cleaner and then an if
err != nil && c.logEvery.ShouldLog() { log.InfofDepth(ctx, 1, ...) }`
15cbf16
to
dceef05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @jordanlewis)
pkg/sql/temporary_schema.go, line 45 at r1 (raw file):
Previously, ajwerner wrote…
smells like a reserved cluster setting.
Done.
pkg/sql/temporary_schema.go, line 334 at r1 (raw file):
Previously, ajwerner wrote…
Is this ever used?
Nope - relic from previous retry library. Fixed!
pkg/sql/temporary_schema.go, line 362 at r1 (raw file):
Previously, ajwerner wrote…
I don't think you've got these metrics hooked up. Once you do I think you'll be dismayed by the linter's strong-arming you to describe there metrics in https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/catalog/chart_catalog.go.
In order to get it hooked up you need to:
- Implement https://godoc.org/github.com/cockroachdb/cockroach/pkg/util/metric#Struct on
temporaryObjectCleanerMetrics
- Provide a method to get the metrics:
func (c *TemporaryObjectCleaner) Metrics() metric.Struct { return &c.metrics }
- export the fields in the metrics struct
- register the metric struct in the registry in server.go
nit: there's a pattern generally throughout the codebase to:
- have the metrics struct as a non-pointer field in its parent
- have a
make...Metrics
function that constructs the metrics struct- store the meta in a var block // I'm not so attached to this but if you look around, it's the pattern
Done.
pkg/sql/temporary_schema.go, line 363 at r1 (raw file):
Previously, ajwerner wrote…
This feels like a thing I'd want to keep as a counter that just increases monotonically. I can see being torn about it. If anything though I'd make the below one a guage before this one. We have good rate/delta computation
It's moments like these where I wish our timeseries database was either more sophisticated or didn't exist. Always incrementing counters which you can subtract from eachother and are known to be sampled at the same time are a really nice metric primitive. We lack this operation in our timeseries database. It's probably pretty easy to add. Anyway, prometheus offers it.
The nice thing about having counters is it also gives you rates. Not that this is a thing that's going to have any interesting pattern in its rate.
I wanted to keep this as a measure for when the cleaner is running, and when the cleaner is not running (i.e. 1=active, 0=not active) -- it's kind of a practice I've had from my old place which was nice to read.
pkg/sql/temporary_schema.go, line 409 at r1 (raw file):
Previously, ajwerner wrote…
This is legit.
Done.
pkg/sql/temporary_schema.go, line 411 at r1 (raw file):
Previously, ajwerner wrote…
did you find the
retryFunc
handy here for tests or something? In general I don't think you need to be retrying this. A lot of the errors will be due to things like shutting down. Leases don't more around very often. I think I'd rather increase the interval than increase the retries here.
Nope, just added retries to everything. I can remove this one. Done.
pkg/sql/temporary_schema.go, line 483 at r1 (raw file):
Previously, ajwerner wrote…
VEventf
is cool. Andrei has been pushing a testing methodology whereby you can examine what happens in the execution of calls by inspecting what got passed to their traces. I like it because it's less verbose thanif log.V(...) { log.Info(...) }
but I'm interest in seeing how sprinkling tracing throughout the code will come in handy later.
Done.
pkg/sql/temporary_schema.go, line 484 at r1 (raw file):
Previously, ajwerner wrote…
This one does feels weird as a counter given what you've made a guage in terms of counting the job runs.
I was thinking this should actively count how many have been deleted. More accurately, I care about (delta)/time period so I can see things like "deletes per second". Is this not what counter is? Sorry, haven't used this metric system much and this is what I've interpreted it to be able to do.
pkg/sql/temporary_schema.go, line 489 at r1 (raw file):
Previously, ajwerner wrote…
This code doesn't seem to log or record a metric in the error path. I get that you're retrying but in the face of some errors but any retries are interesting. Perhaps a metric that you pass to your
retryFunc
to count retries and alog.Every
on the cleaner and then anif
err != nil && c.logEvery.ShouldLog() { log.InfofDepth(ctx, 1, ...) }`
I've decided to just log every error during a retry in the retryFunc.
I think there is something wrong going on. Seen in one of the test failure outputs:
|
7f28db7
to
04b1121
Compare
woops, fixed! |
e1091e8
to
83e4b29
Compare
there was a data race error, i have PR #45896 for it for now. |
9b9caad
to
7587b6e
Compare
the data race error seems blocking... i will wait for that to resolve. |
This PR introduces a `TempObjectCleaner`, which is a background task that runs periodically on every node, but will only be executed by the node that has the meta1 leaseholder. This task will cleanup dangling temporary schemas that could not have been cleaned up by the connection. The job also comes with logging and metrics. A simple test mocking the ticking mechanism has also been included. Also refactored the existing temporary schema deletion to use ExecEx, which overrides the SearchPath / User executing the deletion at a lower layer rather than using `SetSessionExecutor`. Release note (sql change): Introduced a temporary table cleanup job that runs once every periodically per cluster that will remove any temporary schemas and their related objects that did not get removed cleanly when the connection closed. This period can be changed by the cluster setting `sql.temp_object_cleaner.cleanup_interval`, which defaults to 30min-ly.
7587b6e
to
34b4740
Compare
Resolves #45666
This PR introduces a
TempObjectCleaner
, which is a background taskthat runs periodically on every node, but will only be executed by the
node that has the meta1 leaseholder. This task will cleanup dangling
temporary schemas that could not have been cleaned up by the connection.
The job also comes with logging and metrics. A simple test mocking the
ticking mechanism has also been included.
Also refactored the existing temporary schema deletion to use ExecEx,
which overrides the SearchPath / User executing the deletion at a lower
layer rather than using
SetSessionExecutor
.Release note (sql change): Introduced a temporary table cleanup job that
runs once every periodically per cluster that will remove any temporary
schemas and their related objects that did not get removed cleanly when
the connection closed. This period can be changed by the cluster setting
sql.temp_object_cleaner.cleanup_interval
, which defaults to 30min-ly.