-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tablemetadatacache: add cluster settings to automate job execution #130198
Conversation
deef37e
to
6b47378
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! couple of comments and questions
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kyle-a-wong, and @xinhaoz)
pkg/sql/tablemetadatacache/cluster_settings.go
line 23 at r1 (raw file):
var tableMetadataCacheAutoUpdatesEnabled = settings.RegisterBoolSetting( settings.ApplicationLevel, "tablemetadatacache.automatic_updates.enabled",
might be a bit late for this, but I think we can just call this table metadata instead of table metadata cache. for system related statistics (table size, num ranges, etc) I feel that there's an assumption on precomputation and staleness of the metadata (links for MySQL and Postgres related settings below)
information_schema_stats_expiry
autovacuum\_naptime
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 31 at r1 (raw file):
// updateJobExecFn specifies the function that is run on each iteration of the // table metadata update job. It can be overriden in tests. var updateJobExecFn func(context.Context, isql.Executor) error = updateTableMetadataCache
I recommend making this a property on the tableMetadataUpdateJobResumer struct, so that there's less package globals.
There's (light) precedence for finding a resumer in tests:
If possible, I'd recommend finding the resumer, doing a type assertion and then injecting the mock testing function into the test. The jobs.Registry struct has a function GetResumerForClaimedJob
which could help get a direct handle to the resumer in question. Not sure if this comment is totally clear, lmk if it makes sense
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 36 at r1 (raw file):
// table metadata update job. It is not thread-safe and should only be used in // tests prior to starting the cluster. func MockJobExecFn(fn func(context.Context, isql.Executor) error) {
Will this be used outside of tests? If not, I recommend moving it to the test package. There it will neither show up in the package namespace, nor the final cockroach binary
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 88 at r1 (raw file):
this call to timer.Stop can be safely removed
Note that unlike the standard library's Timer type, this Timer will
not begin counting down until Reset is called for the first time
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 92 at r1 (raw file):
for { if tableMetadataCacheAutoUpdatesEnabled.Get(&execCtx.ExecCfg().Settings.SV) { timer.Reset(tableMetadataCacheValidDuration.Get(&execCtx.ExecCfg().Settings.SV))
One question is what happens if the user enables the automatic updates and sets the duration to zero? If it causes the server to spin, it might be worth guarding against that use case.
6b47378
to
77d4c69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)
pkg/sql/tablemetadatacache/cluster_settings.go
line 23 at r1 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
might be a bit late for this, but I think we can just call this table metadata instead of table metadata cache. for system related statistics (table size, num ranges, etc) I feel that there's an assumption on precomputation and staleness of the metadata (links for MySQL and Postgres related settings below)
Renamed.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 31 at r1 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
I recommend making this a property on the tableMetadataUpdateJobResumer struct, so that there's less package globals.
There's (light) precedence for finding a resumer in tests:
If possible, I'd recommend finding the resumer, doing a type assertion and then injecting the mock testing function into the test. The jobs.Registry struct has a function
GetResumerForClaimedJob
which could help get a direct handle to the resumer in question. Not sure if this comment is totally clear, lmk if it makes sense
I'm a little confused about the examples in the links - aren't they just looking for the job id in the jobs table to see if it's running? For this test I don't think we want to mock the entire resumer, just the update logic portion (i.e. what occurs in every loop). There's an alternative way to inject this with testing knobs but it seemed a little overkill to introduce another set of testing knobs to the system, which are also evaluated during runtime, when we could just mock the update logic upfront - what are your thoughts here? I'll dig around a little to see if there are other methods of mocking job executions more granularly. Open to more suggestions here as well.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 36 at r1 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
Will this be used outside of tests? If not, I recommend moving it to the test package. There it will neither show up in the package namespace, nor the final cockroach binary
No it'll only be used for testing. We'll need to export the updateJobExecFn
var if we move this fn to the testing pkg which we'd then just able to set directly in the test.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 88 at r1 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
this call to timer.Stop can be safely removed
Note that unlike the standard library's Timer type, this Timer will
not begin counting down until Reset is called for the first time
Nice, done.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 92 at r1 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
One question is what happens if the user enables the automatic updates and sets the duration to zero? If it causes the server to spin, it might be worth guarding against that use case.
Good call - let's enforce a conservative minimum of 1 minute for now. Added as a validator, lmk if you think the min should be increased.
cdeec75
to
fe20718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @kyle-a-wong, and @xinhaoz)
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 31 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
I'm a little confused about the examples in the links - aren't they just looking for the job id in the jobs table to see if it's running? For this test I don't think we want to mock the entire resumer, just the update logic portion (i.e. what occurs in every loop). There's an alternative way to inject this with testing knobs but it seemed a little overkill to introduce another set of testing knobs to the system, which are also evaluated during runtime, when we could just mock the update logic upfront - what are your thoughts here? I'll dig around a little to see if there are other methods of mocking job executions more granularly. Open to more suggestions here as well.
Ah, I think where I started was kind of confusing. Yeah I was thinking something along the lines of what you described by the testing knobs. Something to the effect of:
type tableMetadataUpdateJobResumer struct {
job * jobs.Job
updateFn func(context.Context, isql.Executor) error
}
You could use the jobId to get a handle to the resumer (so you could patch the updateFn). Thinking about this more though, it seems like the timing would make it tricky. Consider this comment resolved
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 92 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
Good call - let's enforce a conservative minimum of 1 minute for now. Added as a validator, lmk if you think the min should be increased.
That sounds perfect to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r2, 3 of 4 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @angles-n-daemons, @kyle-a-wong, and @xinhaoz)
-- commits
line 4 at r3:
nit UPDATE
-- commits
line 19 at r3:
nit: control
.
pkg/sql/tablemetadatacache/cluster_settings.go
line 23 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
Renamed.
Relatedly, can we reconsider these cluster setting names? I think creating a namespace for this stuff would be nice, maybe obs.
as a prefix, or even putting them under sql.stats.
might be better. Having tablemetadata
at the top level feels messy.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 88 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
Nice, done.
I don't see this change in the latest diff.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 107 at r1 (raw file):
// Run table metadata update job. log.Infof(ctx, "running table metadata update job")
take it or leave it, but could we log which signal told the job to run?
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 74 at r3 (raw file):
// Register callbacks to signal the job to reset the timer when timer related settings change. scheduleSettingsCh := make(chan struct{}) tableMetadataCacheAutoUpdatesEnabled.SetOnChange(&execCtx.ExecCfg().Settings.SV, func(_ context.Context) {
is it possible for Resume
to be run twice on a node even though this is a "forever" job? that would register the callbacks multiple times. I guess that might be fine since scheduleSettingsCh
would be different.
fe20718
to
56b580b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)
Previously, dhartunian (David Hartunian) wrote…
nit
UPDATE
Done.
Previously, dhartunian (David Hartunian) wrote…
nit:
control
.
Done.
pkg/sql/tablemetadatacache/cluster_settings.go
line 23 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Relatedly, can we reconsider these cluster setting names? I think creating a namespace for this stuff would be nice, maybe
obs.
as a prefix, or even putting them undersql.stats.
might be better. Havingtablemetadata
at the top level feels messy.
SGTM, added obs
prefix for now!
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 31 at r1 (raw file):
Previously, angles-n-daemons (Brian Dillmann) wrote…
Ah, I think where I started was kind of confusing. Yeah I was thinking something along the lines of what you described by the testing knobs. Something to the effect of:
type tableMetadataUpdateJobResumer struct { job * jobs.Job updateFn func(context.Context, isql.Executor) error }You could use the jobId to get a handle to the resumer (so you could patch the updateFn). Thinking about this more though, it seems like the timing would make it tricky. Consider this comment resolved
Ahh gotcha. I'll leave this for now and we can introduce it as a testing knob later if it's more well-suited.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 88 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
I don't see this change in the latest diff.
Done for real.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 107 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
take it or leave it, but could we log which signal told the job to run?
Done.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 74 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
is it possible for
Resume
to be run twice on a node even though this is a "forever" job? that would register the callbacks multiple times. I guess that might be fine sincescheduleSettingsCh
would be different.
Yeah I was thinking about this as well. I'm not sure under what circumstances, or if it's even possible that, a job may be abandoned by a node and then picked up again by the same node. In the end I thought it was alright since as you mentioned the scheduleSettinsCh would be different. I also found that this was being done in the span config reconciliation forever job -
cockroach/pkg/spanconfig/spanconfigjob/job.go
Line 130 in e9a6d3d
reconciliationJobCheckpointInterval.SetOnChange(settingValues, func(ctx context.Context) { |
ff90052
to
73324a7
Compare
9a32b93
to
8fc9750
Compare
076774e
to
e53db62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 36 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
No it'll only be used for testing. We'll need to export the
updateJobExecFn
var if we move this fn to the testing pkg which we'd then just able to set directly in the test.
I think instead of us defining this custom mocking function, we should use testutils.TestingHook
. updateJobExecFn
would have to be updated to be public, but then you can do the following in your test below:
defer testutils.TestingHook(&tablemetadatacache.UpdateJobExecFn,
func(ctx context.Context, ie isql.Executor) error {
// mock implementation
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)
pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go
line 177 at r5 (raw file):
conn.Exec(t, `SET CLUSTER SETTING obs.tablemetadata.automatic_updates.enabled = f`) initialCount := getMockCallCount() err := waitForJobRuns(1, 200*time.Millisecond)
Do you think you should also set the TableMetadataCacheValidDuration
value to be relatively low to ensure that there are no automatic updates. If a bug is introduced where automatic_upates isn't respected, but the cache valid duration is, It could be the case that the default time (20m) hasn't elapsed but it is still doing automatic updates.
Code quote:
conn.Exec(t, `SET CLUSTER SETTING obs.tablemetadata.automatic_updates.enabled = f`)
initialCount := getMockCallCount()
err := waitForJobRuns(1, 200*time.Millisecond)
pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go
line 187 at r5 (raw file):
for i := 1; i < len(mockCalls); i++ { timeBetweenCalls := mockCalls[i].Sub(mockCalls[i-1]) require.GreaterOrEqual(t, timeBetweenCalls, 50*time.Millisecond,
I'm not sure if it makes sense for these assertions to be in a separate t.Run(). It isn't inherently obvious that if this subtest fails, it's actually failing due to code running in "AutomaticUpdatesEnabled"
e53db62
to
c79de01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons and @dhartunian)
pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go
line 36 at r1 (raw file):
Previously, kyle-a-wong (Kyle Wong) wrote…
I think instead of us defining this custom mocking function, we should use
testutils.TestingHook
.updateJobExecFn
would have to be updated to be public, but then you can do the following in your test below:defer testutils.TestingHook(&tablemetadatacache.UpdateJobExecFn, func(ctx context.Context, ie isql.Executor) error { // mock implementation })
Ooh, nice find - done. Let's just get rid of this separate testing package for now I think it was not necessary to split and removing it gets rid of some unnecessary pkg exports.
This commit adds the ability to automate running the `UPDATE TABLE METADATA` job. Two new cluster settings are introduced to control the interval the job is run at: - `obs.tablemetadatacache.data_valid_duration` non-negative duration setting, which specifies the duration for which the data in system.table_metadata is considered valid. This duration will be used as a threshold to inform the system on when we should run the job to refresh data. Default 20m. - `obs.tablemetadatacache.automatic_updates.enabled` boolean setting, which defaults to false. If true, the system will execute the update job on according to the interval setting above. Fixes: cockroachdb#130098 Epic: CRDB-37558 Release note (ops change): New cluster settings have been added which control the refresh behaviour for the cached data shown in the db pages in db console. - `obs.tablemetadatacache.data_valid_duration` non-negative duration setting, which specifies the duration for which the data in system.table_metadata is considered valid. If the cache's last update time exceeds this threshold, users visiting the db pages will trigger a cache refresh. Default 20m. - `obs.tablemetadatacache.automatic_updates.enabled` enables the cache to be automatically updated according the validity interval. Default false.
c79de01
to
254e150
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)
pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go
line 177 at r5 (raw file):
Previously, kyle-a-wong (Kyle Wong) wrote…
Do you think you should also set the
TableMetadataCacheValidDuration
value to be relatively low to ensure that there are no automatic updates. If a bug is introduced where automatic_upates isn't respected, but the cache valid duration is, It could be the case that the default time (20m) hasn't elapsed but it is still doing automatic updates.
The validity duration should still have the previously set short value of 50ms. Copied it over to this subtest too in case code changes over there.
pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go
line 187 at r5 (raw file):
Previously, kyle-a-wong (Kyle Wong) wrote…
I'm not sure if it makes sense for these assertions to be in a separate t.Run(). It isn't inherently obvious that if this subtest fails, it's actually failing due to code running in
"AutomaticUpdatesEnabled"
Merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)
pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go
line 177 at r5 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
The validity duration should still have the previously set short value of 50ms. Copied it over to this subtest too in case code changes over there.
Cool thanks! One thing i forgot to mention was that if you run this subtest individually, it wouldn't have the previous subtest's settings, resulting in the situation i mentioned above.
TFTR! |
This commit adds the ability to automate running the
UPDATE TABLE METADATA
job. Two new cluster settings are introduced to controlthe interval the job is run at:
obs.tablemetadatacache.data_valid_duration
non-negative durationsetting, which specifies the duration for which the data in
system.table_metadata is considered valid. This duration will be used as
a threshold to inform the system on when we should run the job to refresh
data. Default 20m.
obs.tablemetadatacache.automatic_updates.enabled
boolean setting,which defaults to false. If true, the system will execute the update job
on according to the interval setting above.
Fixes: #130098
Epic: CRDB-37558
Release note (ops change): New cluster settings have been added
which control the refresh behaviour for the cached data shown
in the db pages in db console.
obs.tablemetadatacache.data_valid_duration
non-negative durationsetting, which specifies the duration for which the data in
system.table_metadata is considered valid. If the cache's last update
time exceeds this threshold, users visiting the db pages will trigger
a cache refresh. Default 20m.
obs.tablemetadatacache.automatic_updates.enabled
enables the cacheto be automatically updated according the validity interval. Default false.