tablemetadatacache: add cluster settings to automate job execution #130198

xinhaoz · 2024-09-05T22:28:07Z

This commit adds the ability to automate running the UPDATE TABLE METADATA job. Two new cluster settings are introduced to control
the interval the job is run at:

obs.tablemetadatacache.data_valid_duration non-negative duration
setting, which specifies the duration for which the data in
system.table_metadata is considered valid. This duration will be used as
a threshold to inform the system on when we should run the job to refresh
data. Default 20m.
obs.tablemetadatacache.automatic_updates.enabled boolean setting,
which defaults to false. If true, the system will execute the update job
on according to the interval setting above.

Fixes: #130098
Epic: CRDB-37558

Release note (ops change): New cluster settings have been added
which control the refresh behaviour for the cached data shown
in the db pages in db console.

obs.tablemetadatacache.data_valid_duration non-negative duration
setting, which specifies the duration for which the data in
system.table_metadata is considered valid. If the cache's last update
time exceeds this threshold, users visiting the db pages will trigger
a cache refresh. Default 20m.
obs.tablemetadatacache.automatic_updates.enabled enables the cache
to be automatically updated according the validity interval. Default false.

cockroach-teamcity · 2024-09-05T22:28:18Z

This change is

angles-n-daemons

looks good! couple of comments and questions

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kyle-a-wong, and @xinhaoz)

pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

var tableMetadataCacheAutoUpdatesEnabled = settings.RegisterBoolSetting(
	settings.ApplicationLevel,
	"tablemetadatacache.automatic_updates.enabled",

might be a bit late for this, but I think we can just call this table metadata instead of table metadata cache. for system related statistics (table size, num ranges, etc) I feel that there's an assumption on precomputation and staleness of the metadata (links for MySQL and Postgres related settings below)

information_schema_stats_expiry
autovacuum\_naptime

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

// updateJobExecFn specifies the function that is run on each iteration of the
// table metadata update job. It can be overriden in tests.
var updateJobExecFn func(context.Context, isql.Executor) error = updateTableMetadataCache

I recommend making this a property on the tableMetadataUpdateJobResumer struct, so that there's less package globals.

There's (light) precedence for finding a resumer in tests:

If possible, I'd recommend finding the resumer, doing a type assertion and then injecting the mock testing function into the test. The jobs.Registry struct has a function GetResumerForClaimedJob which could help get a direct handle to the resumer in question. Not sure if this comment is totally clear, lmk if it makes sense

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

// table metadata update job. It is not thread-safe and should only be used in
// tests prior to starting the cluster.
func MockJobExecFn(fn func(context.Context, isql.Executor) error) {

Will this be used outside of tests? If not, I recommend moving it to the test package. There it will neither show up in the package namespace, nor the final cockroach binary

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):
this call to timer.Stop can be safely removed

Note that unlike the standard library's Timer type, this Timer will
not begin counting down until Reset is called for the first time

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 92 at r1 (raw file):

	for {
		if tableMetadataCacheAutoUpdatesEnabled.Get(&execCtx.ExecCfg().Settings.SV) {
			timer.Reset(tableMetadataCacheValidDuration.Get(&execCtx.ExecCfg().Settings.SV))

One question is what happens if the user enables the automatic updates and sets the duration to zero? If it causes the server to spin, it might be worth guarding against that use case.

xinhaoz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)

pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

might be a bit late for this, but I think we can just call this table metadata instead of table metadata cache. for system related statistics (table size, num ranges, etc) I feel that there's an assumption on precomputation and staleness of the metadata (links for MySQL and Postgres related settings below)

information_schema_stats_expiry
autovacuum\_naptime

Renamed.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

I recommend making this a property on the tableMetadataUpdateJobResumer struct, so that there's less package globals.

There's (light) precedence for finding a resumer in tests:

roachtest

sql package test

If possible, I'd recommend finding the resumer, doing a type assertion and then injecting the mock testing function into the test. The jobs.Registry struct has a function GetResumerForClaimedJob which could help get a direct handle to the resumer in question. Not sure if this comment is totally clear, lmk if it makes sense

I'm a little confused about the examples in the links - aren't they just looking for the job id in the jobs table to see if it's running? For this test I don't think we want to mock the entire resumer, just the update logic portion (i.e. what occurs in every loop). There's an alternative way to inject this with testing knobs but it seemed a little overkill to introduce another set of testing knobs to the system, which are also evaluated during runtime, when we could just mock the update logic upfront - what are your thoughts here? I'll dig around a little to see if there are other methods of mocking job executions more granularly. Open to more suggestions here as well.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

Will this be used outside of tests? If not, I recommend moving it to the test package. There it will neither show up in the package namespace, nor the final cockroach binary

No it'll only be used for testing. We'll need to export the updateJobExecFn var if we move this fn to the testing pkg which we'd then just able to set directly in the test.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

this call to timer.Stop can be safely removed

Note that unlike the standard library's Timer type, this Timer will
not begin counting down until Reset is called for the first time

Nice, done.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 92 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

One question is what happens if the user enables the automatic updates and sets the duration to zero? If it causes the server to spin, it might be worth guarding against that use case.

Good call - let's enforce a conservative minimum of 1 minute for now. Added as a validator, lmk if you think the min should be increased.

angles-n-daemons

nice looks good to me

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @kyle-a-wong, and @xinhaoz)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

I'm a little confused about the examples in the links - aren't they just looking for the job id in the jobs table to see if it's running? For this test I don't think we want to mock the entire resumer, just the update logic portion (i.e. what occurs in every loop). There's an alternative way to inject this with testing knobs but it seemed a little overkill to introduce another set of testing knobs to the system, which are also evaluated during runtime, when we could just mock the update logic upfront - what are your thoughts here? I'll dig around a little to see if there are other methods of mocking job executions more granularly. Open to more suggestions here as well.

Ah, I think where I started was kind of confusing. Yeah I was thinking something along the lines of what you described by the testing knobs. Something to the effect of:

type tableMetadataUpdateJobResumer struct {
    job * jobs.Job
    updateFn func(context.Context, isql.Executor) error
}

You could use the jobId to get a handle to the resumer (so you could patch the updateFn). Thinking about this more though, it seems like the timing would make it tricky. Consider this comment resolved

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 92 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Good call - let's enforce a conservative minimum of 1 minute for now. Added as a validator, lmk if you think the min should be increased.

That sounds perfect to me

dhartunian

Reviewed 1 of 2 files at r2, 3 of 4 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @angles-n-daemons, @kyle-a-wong, and @xinhaoz)

-- commits line 4 at r3:
nit UPDATE

-- commits line 19 at r3:
nit: control.

pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Renamed.

Relatedly, can we reconsider these cluster setting names? I think creating a namespace for this stuff would be nice, maybe obs. as a prefix, or even putting them under sql.stats. might be better. Having tablemetadata at the top level feels messy.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Nice, done.

I don't see this change in the latest diff.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 107 at r1 (raw file):

		// Run table metadata update job.
		log.Infof(ctx, "running table metadata update job")

take it or leave it, but could we log which signal told the job to run?

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 74 at r3 (raw file):

	// Register callbacks to signal the job to reset the timer when timer related settings change.
	scheduleSettingsCh := make(chan struct{})
	tableMetadataCacheAutoUpdatesEnabled.SetOnChange(&execCtx.ExecCfg().Settings.SV, func(_ context.Context) {

is it possible for Resume to be run twice on a node even though this is a "forever" job? that would register the callbacks multiple times. I guess that might be fine since scheduleSettingsCh would be different.

xinhaoz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)

-- commits line 4 at r3:

Previously, dhartunian (David Hartunian) wrote…

nit UPDATE

Done.

-- commits line 19 at r3:

Previously, dhartunian (David Hartunian) wrote…

nit: control.

Done.

pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Relatedly, can we reconsider these cluster setting names? I think creating a namespace for this stuff would be nice, maybe obs. as a prefix, or even putting them under sql.stats. might be better. Having tablemetadata at the top level feels messy.

SGTM, added obs prefix for now!

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

Ah, I think where I started was kind of confusing. Yeah I was thinking something along the lines of what you described by the testing knobs. Something to the effect of:
type tableMetadataUpdateJobResumer struct {
    job * jobs.Job
    updateFn func(context.Context, isql.Executor) error
}
You could use the jobId to get a handle to the resumer (so you could patch the updateFn). Thinking about this more though, it seems like the timing would make it tricky. Consider this comment resolved

Ahh gotcha. I'll leave this for now and we can introduce it as a testing knob later if it's more well-suited.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I don't see this change in the latest diff.

Done for real.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 107 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

take it or leave it, but could we log which signal told the job to run?

Done.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 74 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

is it possible for Resume to be run twice on a node even though this is a "forever" job? that would register the callbacks multiple times. I guess that might be fine since scheduleSettingsCh would be different.

Yeah I was thinking about this as well. I'm not sure under what circumstances, or if it's even possible that, a job may be abandoned by a node and then picked up again by the same node. In the end I thought it was alright since as you mentioned the scheduleSettinsCh would be different. I also found that this was being done in the span config reconciliation forever job -

cockroach/pkg/spanconfig/spanconfigjob/job.go

Line 130 in e9a6d3d

    
           reconciliationJobCheckpointInterval.SetOnChange(settingValues, func(ctx context.Context) {

.

kyle-a-wong

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

No it'll only be used for testing. We'll need to export the updateJobExecFn var if we move this fn to the testing pkg which we'd then just able to set directly in the test.

I think instead of us defining this custom mocking function, we should use testutils.TestingHook. updateJobExecFn would have to be updated to be public, but then you can do the following in your test below:

defer testutils.TestingHook(&tablemetadatacache.UpdateJobExecFn,
  func(ctx context.Context, ie isql.Executor) error {
     // mock implementation
  })

kyle-a-wong

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 177 at r5 (raw file):

		conn.Exec(t, `SET CLUSTER SETTING obs.tablemetadata.automatic_updates.enabled = f`)
		initialCount := getMockCallCount()
		err := waitForJobRuns(1, 200*time.Millisecond)

Do you think you should also set the TableMetadataCacheValidDuration value to be relatively low to ensure that there are no automatic updates. If a bug is introduced where automatic_upates isn't respected, but the cache valid duration is, It could be the case that the default time (20m) hasn't elapsed but it is still doing automatic updates.

Code quote:

		conn.Exec(t, `SET CLUSTER SETTING obs.tablemetadata.automatic_updates.enabled = f`)
		initialCount := getMockCallCount()
		err := waitForJobRuns(1, 200*time.Millisecond)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 187 at r5 (raw file):

		for i := 1; i < len(mockCalls); i++ {
			timeBetweenCalls := mockCalls[i].Sub(mockCalls[i-1])
			require.GreaterOrEqual(t, timeBetweenCalls, 50*time.Millisecond,

I'm not sure if it makes sense for these assertions to be in a separate t.Run(). It isn't inherently obvious that if this subtest fails, it's actually failing due to code running in "AutomaticUpdatesEnabled"

xinhaoz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons and @dhartunian)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

Previously, kyle-a-wong (Kyle Wong) wrote…

I think instead of us defining this custom mocking function, we should use testutils.TestingHook. updateJobExecFn would have to be updated to be public, but then you can do the following in your test below:
defer testutils.TestingHook(&tablemetadatacache.UpdateJobExecFn,
  func(ctx context.Context, ie isql.Executor) error {
     // mock implementation
  })

Ooh, nice find - done. Let's just get rid of this separate testing package for now I think it was not necessary to split and removing it gets rid of some unnecessary pkg exports.

This commit adds the ability to automate running the `UPDATE TABLE METADATA` job. Two new cluster settings are introduced to control the interval the job is run at: - `obs.tablemetadatacache.data_valid_duration` non-negative duration setting, which specifies the duration for which the data in system.table_metadata is considered valid. This duration will be used as a threshold to inform the system on when we should run the job to refresh data. Default 20m. - `obs.tablemetadatacache.automatic_updates.enabled` boolean setting, which defaults to false. If true, the system will execute the update job on according to the interval setting above. Fixes: cockroachdb#130098 Epic: CRDB-37558 Release note (ops change): New cluster settings have been added which control the refresh behaviour for the cached data shown in the db pages in db console. - `obs.tablemetadatacache.data_valid_duration` non-negative duration setting, which specifies the duration for which the data in system.table_metadata is considered valid. If the cache's last update time exceeds this threshold, users visiting the db pages will trigger a cache refresh. Default 20m. - `obs.tablemetadatacache.automatic_updates.enabled` enables the cache to be automatically updated according the validity interval. Default false.

xinhaoz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 177 at r5 (raw file):

Previously, kyle-a-wong (Kyle Wong) wrote…

Do you think you should also set the TableMetadataCacheValidDuration value to be relatively low to ensure that there are no automatic updates. If a bug is introduced where automatic_upates isn't respected, but the cache valid duration is, It could be the case that the default time (20m) hasn't elapsed but it is still doing automatic updates.

The validity duration should still have the previously set short value of 50ms. Copied it over to this subtest too in case code changes over there.

pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 187 at r5 (raw file):

Previously, kyle-a-wong (Kyle Wong) wrote…

I'm not sure if it makes sense for these assertions to be in a separate t.Run(). It isn't inherently obvious that if this subtest fails, it's actually failing due to code running in "AutomaticUpdatesEnabled"

Merged.

kyle-a-wong

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 177 at r5 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

The validity duration should still have the previously set short value of 50ms. Copied it over to this subtest too in case code changes over there.

Cool thanks! One thing i forgot to mention was that if you run this subtest individually, it wouldn't have the previous subtest's settings, resulting in the situation i mentioned above.

xinhaoz · 2024-09-13T17:01:31Z

TFTR!
bors r+

craig · 2024-09-13T17:36:52Z

Build succeeded:

xinhaoz force-pushed the table-meta-job-settings branch 2 times, most recently from deef37e to 6b47378 Compare September 6, 2024 18:27

xinhaoz requested review from a team, kyle-a-wong, angles-n-daemons and dhartunian and removed request for a team September 6, 2024 18:30

xinhaoz marked this pull request as ready for review September 6, 2024 18:33

angles-n-daemons requested changes Sep 6, 2024

View reviewed changes

xinhaoz force-pushed the table-meta-job-settings branch from 6b47378 to 77d4c69 Compare September 6, 2024 20:42

xinhaoz commented Sep 6, 2024

View reviewed changes

xinhaoz force-pushed the table-meta-job-settings branch 2 times, most recently from cdeec75 to fe20718 Compare September 6, 2024 21:01

xinhaoz requested a review from angles-n-daemons September 9, 2024 14:07

angles-n-daemons approved these changes Sep 9, 2024

View reviewed changes

dhartunian requested a review from angles-n-daemons September 9, 2024 17:27

dhartunian reviewed Sep 9, 2024

View reviewed changes

xinhaoz force-pushed the table-meta-job-settings branch from fe20718 to 56b580b Compare September 11, 2024 15:14

xinhaoz commented Sep 11, 2024

View reviewed changes

xinhaoz force-pushed the table-meta-job-settings branch 10 times, most recently from ff90052 to 73324a7 Compare September 11, 2024 22:07

xinhaoz force-pushed the table-meta-job-settings branch 5 times, most recently from 9a32b93 to 8fc9750 Compare September 12, 2024 15:41

xinhaoz requested a review from dhartunian September 12, 2024 22:08

xinhaoz force-pushed the table-meta-job-settings branch 4 times, most recently from 076774e to e53db62 Compare September 13, 2024 15:01

kyle-a-wong reviewed Sep 13, 2024

View reviewed changes

kyle-a-wong requested changes Sep 13, 2024

View reviewed changes

xinhaoz force-pushed the table-meta-job-settings branch from e53db62 to c79de01 Compare September 13, 2024 16:03

xinhaoz commented Sep 13, 2024

View reviewed changes

xinhaoz requested a review from kyle-a-wong September 13, 2024 16:03

xinhaoz force-pushed the table-meta-job-settings branch from c79de01 to 254e150 Compare September 13, 2024 16:07

xinhaoz commented Sep 13, 2024

View reviewed changes

kyle-a-wong approved these changes Sep 13, 2024

View reviewed changes

craig bot merged commit 259cde8 into cockroachdb:master Sep 13, 2024
23 checks passed

xinhaoz deleted the table-meta-job-settings branch September 20, 2024 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tablemetadatacache: add cluster settings to automate job execution #130198

tablemetadatacache: add cluster settings to automate job execution #130198

xinhaoz commented Sep 5, 2024 •

edited

Loading

cockroach-teamcity commented Sep 5, 2024

angles-n-daemons left a comment

xinhaoz left a comment

angles-n-daemons left a comment

dhartunian left a comment

xinhaoz left a comment

kyle-a-wong left a comment

kyle-a-wong left a comment

xinhaoz left a comment

xinhaoz left a comment

kyle-a-wong left a comment

xinhaoz commented Sep 13, 2024

craig bot commented Sep 13, 2024

tablemetadatacache: add cluster settings to automate job execution #130198

tablemetadatacache: add cluster settings to automate job execution #130198

Conversation

xinhaoz commented Sep 5, 2024 • edited Loading

cockroach-teamcity commented Sep 5, 2024

angles-n-daemons left a comment

Choose a reason for hiding this comment

xinhaoz left a comment

Choose a reason for hiding this comment

angles-n-daemons left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

xinhaoz left a comment

Choose a reason for hiding this comment

kyle-a-wong left a comment

Choose a reason for hiding this comment

kyle-a-wong left a comment

Choose a reason for hiding this comment

xinhaoz left a comment

Choose a reason for hiding this comment

xinhaoz left a comment

Choose a reason for hiding this comment

kyle-a-wong left a comment

Choose a reason for hiding this comment

xinhaoz commented Sep 13, 2024

craig bot commented Sep 13, 2024

xinhaoz commented Sep 5, 2024 •

edited

Loading