Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tablemetadatacache: add cluster settings to automate job execution #130198

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

xinhaoz
Copy link
Member

@xinhaoz xinhaoz commented Sep 5, 2024

This commit adds the ability to automate running the UPDATE TABLE METADATA job. Two new cluster settings are introduced to control
the interval the job is run at:

  • obs.tablemetadatacache.data_valid_duration non-negative duration
    setting, which specifies the duration for which the data in
    system.table_metadata is considered valid. This duration will be used as
    a threshold to inform the system on when we should run the job to refresh
    data. Default 20m.
  • obs.tablemetadatacache.automatic_updates.enabled boolean setting,
    which defaults to false. If true, the system will execute the update job
    on according to the interval setting above.

Fixes: #130098
Epic: CRDB-37558

Release note (ops change): New cluster settings have been added
which control the refresh behaviour for the cached data shown
in the db pages in db console.

  • obs.tablemetadatacache.data_valid_duration non-negative duration
    setting, which specifies the duration for which the data in
    system.table_metadata is considered valid. If the cache's last update
    time exceeds this threshold, users visiting the db pages will trigger
    a cache refresh. Default 20m.
  • obs.tablemetadatacache.automatic_updates.enabled enables the cache
    to be automatically updated according the validity interval. Default false.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch 2 times, most recently from deef37e to 6b47378 Compare September 6, 2024 18:27
@xinhaoz xinhaoz requested review from a team, kyle-a-wong, angles-n-daemons and dhartunian and removed request for a team September 6, 2024 18:30
@xinhaoz xinhaoz marked this pull request as ready for review September 6, 2024 18:33
Copy link
Contributor

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! couple of comments and questions

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @kyle-a-wong, and @xinhaoz)


pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

var tableMetadataCacheAutoUpdatesEnabled = settings.RegisterBoolSetting(
	settings.ApplicationLevel,
	"tablemetadatacache.automatic_updates.enabled",

might be a bit late for this, but I think we can just call this table metadata instead of table metadata cache. for system related statistics (table size, num ranges, etc) I feel that there's an assumption on precomputation and staleness of the metadata (links for MySQL and Postgres related settings below)

information_schema_stats_expiry
autovacuum\_naptime


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

// updateJobExecFn specifies the function that is run on each iteration of the
// table metadata update job. It can be overriden in tests.
var updateJobExecFn func(context.Context, isql.Executor) error = updateTableMetadataCache

I recommend making this a property on the tableMetadataUpdateJobResumer struct, so that there's less package globals.

There's (light) precedence for finding a resumer in tests:

If possible, I'd recommend finding the resumer, doing a type assertion and then injecting the mock testing function into the test. The jobs.Registry struct has a function GetResumerForClaimedJob which could help get a direct handle to the resumer in question. Not sure if this comment is totally clear, lmk if it makes sense


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

// table metadata update job. It is not thread-safe and should only be used in
// tests prior to starting the cluster.
func MockJobExecFn(fn func(context.Context, isql.Executor) error) {

Will this be used outside of tests? If not, I recommend moving it to the test package. There it will neither show up in the package namespace, nor the final cockroach binary


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):
this call to timer.Stop can be safely removed

Note that unlike the standard library's Timer type, this Timer will
not begin counting down until Reset is called for the first time


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 92 at r1 (raw file):

	for {
		if tableMetadataCacheAutoUpdatesEnabled.Get(&execCtx.ExecCfg().Settings.SV) {
			timer.Reset(tableMetadataCacheValidDuration.Get(&execCtx.ExecCfg().Settings.SV))

One question is what happens if the user enables the automatic updates and sets the duration to zero? If it causes the server to spin, it might be worth guarding against that use case.

@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch from 6b47378 to 77d4c69 Compare September 6, 2024 20:42
Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)


pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

might be a bit late for this, but I think we can just call this table metadata instead of table metadata cache. for system related statistics (table size, num ranges, etc) I feel that there's an assumption on precomputation and staleness of the metadata (links for MySQL and Postgres related settings below)

information_schema_stats_expiry
autovacuum\_naptime

Renamed.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

I recommend making this a property on the tableMetadataUpdateJobResumer struct, so that there's less package globals.

There's (light) precedence for finding a resumer in tests:

If possible, I'd recommend finding the resumer, doing a type assertion and then injecting the mock testing function into the test. The jobs.Registry struct has a function GetResumerForClaimedJob which could help get a direct handle to the resumer in question. Not sure if this comment is totally clear, lmk if it makes sense

I'm a little confused about the examples in the links - aren't they just looking for the job id in the jobs table to see if it's running? For this test I don't think we want to mock the entire resumer, just the update logic portion (i.e. what occurs in every loop). There's an alternative way to inject this with testing knobs but it seemed a little overkill to introduce another set of testing knobs to the system, which are also evaluated during runtime, when we could just mock the update logic upfront - what are your thoughts here? I'll dig around a little to see if there are other methods of mocking job executions more granularly. Open to more suggestions here as well.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

Will this be used outside of tests? If not, I recommend moving it to the test package. There it will neither show up in the package namespace, nor the final cockroach binary

No it'll only be used for testing. We'll need to export the updateJobExecFn var if we move this fn to the testing pkg which we'd then just able to set directly in the test.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

this call to timer.Stop can be safely removed

Note that unlike the standard library's Timer type, this Timer will
not begin counting down until Reset is called for the first time

Nice, done.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 92 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

One question is what happens if the user enables the automatic updates and sets the duration to zero? If it causes the server to spin, it might be worth guarding against that use case.

Good call - let's enforce a conservative minimum of 1 minute for now. Added as a validator, lmk if you think the min should be increased.

@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch 2 times, most recently from cdeec75 to fe20718 Compare September 6, 2024 21:01
Copy link
Contributor

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice looks good to me
:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @kyle-a-wong, and @xinhaoz)


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

I'm a little confused about the examples in the links - aren't they just looking for the job id in the jobs table to see if it's running? For this test I don't think we want to mock the entire resumer, just the update logic portion (i.e. what occurs in every loop). There's an alternative way to inject this with testing knobs but it seemed a little overkill to introduce another set of testing knobs to the system, which are also evaluated during runtime, when we could just mock the update logic upfront - what are your thoughts here? I'll dig around a little to see if there are other methods of mocking job executions more granularly. Open to more suggestions here as well.

Ah, I think where I started was kind of confusing. Yeah I was thinking something along the lines of what you described by the testing knobs. Something to the effect of:

type tableMetadataUpdateJobResumer struct {
    job * jobs.Job
    updateFn func(context.Context, isql.Executor) error
}

You could use the jobId to get a handle to the resumer (so you could patch the updateFn). Thinking about this more though, it seems like the timing would make it tricky. Consider this comment resolved


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 92 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Good call - let's enforce a conservative minimum of 1 minute for now. Added as a validator, lmk if you think the min should be increased.

That sounds perfect to me

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r2, 3 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @angles-n-daemons, @kyle-a-wong, and @xinhaoz)


-- commits line 4 at r3:
nit UPDATE


-- commits line 19 at r3:
nit: control.


pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Renamed.

Relatedly, can we reconsider these cluster setting names? I think creating a namespace for this stuff would be nice, maybe obs. as a prefix, or even putting them under sql.stats. might be better. Having tablemetadata at the top level feels messy.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Nice, done.

I don't see this change in the latest diff.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 107 at r1 (raw file):

		// Run table metadata update job.
		log.Infof(ctx, "running table metadata update job")

take it or leave it, but could we log which signal told the job to run?


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 74 at r3 (raw file):

	// Register callbacks to signal the job to reset the timer when timer related settings change.
	scheduleSettingsCh := make(chan struct{})
	tableMetadataCacheAutoUpdatesEnabled.SetOnChange(&execCtx.ExecCfg().Settings.SV, func(_ context.Context) {

is it possible for Resume to be run twice on a node even though this is a "forever" job? that would register the callbacks multiple times. I guess that might be fine since scheduleSettingsCh would be different.

@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch from fe20718 to 56b580b Compare September 11, 2024 15:14
Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)


-- commits line 4 at r3:

Previously, dhartunian (David Hartunian) wrote…

nit UPDATE

Done.


-- commits line 19 at r3:

Previously, dhartunian (David Hartunian) wrote…

nit: control.

Done.


pkg/sql/tablemetadatacache/cluster_settings.go line 23 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Relatedly, can we reconsider these cluster setting names? I think creating a namespace for this stuff would be nice, maybe obs. as a prefix, or even putting them under sql.stats. might be better. Having tablemetadata at the top level feels messy.

SGTM, added obs prefix for now!


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 31 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

Ah, I think where I started was kind of confusing. Yeah I was thinking something along the lines of what you described by the testing knobs. Something to the effect of:

type tableMetadataUpdateJobResumer struct {
    job * jobs.Job
    updateFn func(context.Context, isql.Executor) error
}

You could use the jobId to get a handle to the resumer (so you could patch the updateFn). Thinking about this more though, it seems like the timing would make it tricky. Consider this comment resolved

Ahh gotcha. I'll leave this for now and we can introduce it as a testing knob later if it's more well-suited.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 88 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I don't see this change in the latest diff.

Done for real.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 107 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

take it or leave it, but could we log which signal told the job to run?

Done.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 74 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

is it possible for Resume to be run twice on a node even though this is a "forever" job? that would register the callbacks multiple times. I guess that might be fine since scheduleSettingsCh would be different.

Yeah I was thinking about this as well. I'm not sure under what circumstances, or if it's even possible that, a job may be abandoned by a node and then picked up again by the same node. In the end I thought it was alright since as you mentioned the scheduleSettinsCh would be different. I also found that this was being done in the span config reconciliation forever job -

reconciliationJobCheckpointInterval.SetOnChange(settingValues, func(ctx context.Context) {
.

@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch 10 times, most recently from ff90052 to 73324a7 Compare September 11, 2024 22:07
@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch 5 times, most recently from 9a32b93 to 8fc9750 Compare September 12, 2024 15:41
@xinhaoz xinhaoz requested a review from dhartunian September 12, 2024 22:08
@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch 4 times, most recently from 076774e to e53db62 Compare September 13, 2024 15:01
Copy link
Contributor

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

No it'll only be used for testing. We'll need to export the updateJobExecFn var if we move this fn to the testing pkg which we'd then just able to set directly in the test.

I think instead of us defining this custom mocking function, we should use testutils.TestingHook. updateJobExecFn would have to be updated to be public, but then you can do the following in your test below:

defer testutils.TestingHook(&tablemetadatacache.UpdateJobExecFn,
  func(ctx context.Context, ie isql.Executor) error {
     // mock implementation
  })

Copy link
Contributor

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)


pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 177 at r5 (raw file):

		conn.Exec(t, `SET CLUSTER SETTING obs.tablemetadata.automatic_updates.enabled = f`)
		initialCount := getMockCallCount()
		err := waitForJobRuns(1, 200*time.Millisecond)

Do you think you should also set the TableMetadataCacheValidDuration value to be relatively low to ensure that there are no automatic updates. If a bug is introduced where automatic_upates isn't respected, but the cache valid duration is, It could be the case that the default time (20m) hasn't elapsed but it is still doing automatic updates.

Code quote:

		conn.Exec(t, `SET CLUSTER SETTING obs.tablemetadata.automatic_updates.enabled = f`)
		initialCount := getMockCallCount()
		err := waitForJobRuns(1, 200*time.Millisecond)

pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 187 at r5 (raw file):

		for i := 1; i < len(mockCalls); i++ {
			timeBetweenCalls := mockCalls[i].Sub(mockCalls[i-1])
			require.GreaterOrEqual(t, timeBetweenCalls, 50*time.Millisecond,

I'm not sure if it makes sense for these assertions to be in a separate t.Run(). It isn't inherently obvious that if this subtest fails, it's actually failing due to code running in "AutomaticUpdatesEnabled"

@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch from e53db62 to c79de01 Compare September 13, 2024 16:03
Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons and @dhartunian)


pkg/sql/tablemetadatacache/update_table_metadata_cache_job.go line 36 at r1 (raw file):

Previously, kyle-a-wong (Kyle Wong) wrote…

I think instead of us defining this custom mocking function, we should use testutils.TestingHook. updateJobExecFn would have to be updated to be public, but then you can do the following in your test below:

defer testutils.TestingHook(&tablemetadatacache.UpdateJobExecFn,
  func(ctx context.Context, ie isql.Executor) error {
     // mock implementation
  })

Ooh, nice find - done. Let's just get rid of this separate testing package for now I think it was not necessary to split and removing it gets rid of some unnecessary pkg exports.

@xinhaoz xinhaoz requested a review from kyle-a-wong September 13, 2024 16:03
This commit adds the ability to automate running the `UPDATE TABLE
METADATA` job. Two new cluster settings are introduced to control
the interval the job is run at:
- `obs.tablemetadatacache.data_valid_duration` non-negative duration
setting, which specifies the duration for which the data in
system.table_metadata is considered valid. This duration will be used as
a threshold to inform the system on when we should run the job to refresh
data. Default 20m.
- `obs.tablemetadatacache.automatic_updates.enabled` boolean setting,
which defaults to false. If true, the system will execute the update job
on according to the interval setting above.

Fixes: cockroachdb#130098
Epic: CRDB-37558

Release note (ops change): New cluster settings have been added
which control the refresh behaviour for the cached data shown
in the db pages in db console.
- `obs.tablemetadatacache.data_valid_duration` non-negative duration
setting, which specifies the duration for which the data in
system.table_metadata is considered valid. If the cache's last update
time  exceeds this threshold, users visiting the db pages will trigger
a cache refresh. Default 20m.
 - `obs.tablemetadatacache.automatic_updates.enabled` enables the cache
to be automatically updated according the validity interval. Default false.
@xinhaoz xinhaoz force-pushed the table-meta-job-settings branch from c79de01 to 254e150 Compare September 13, 2024 16:07
Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @kyle-a-wong)


pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 177 at r5 (raw file):

Previously, kyle-a-wong (Kyle Wong) wrote…

Do you think you should also set the TableMetadataCacheValidDuration value to be relatively low to ensure that there are no automatic updates. If a bug is introduced where automatic_upates isn't respected, but the cache valid duration is, It could be the case that the default time (20m) hasn't elapsed but it is still doing automatic updates.

The validity duration should still have the previously set short value of 50ms. Copied it over to this subtest too in case code changes over there.


pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 187 at r5 (raw file):

Previously, kyle-a-wong (Kyle Wong) wrote…

I'm not sure if it makes sense for these assertions to be in a separate t.Run(). It isn't inherently obvious that if this subtest fails, it's actually failing due to code running in "AutomaticUpdatesEnabled"

Merged.

Copy link
Contributor

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @dhartunian, and @xinhaoz)


pkg/sql/tablemetadatacache/update_table_metadata_cache_job_test.go line 177 at r5 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

The validity duration should still have the previously set short value of 50ms. Copied it over to this subtest too in case code changes over there.

Cool thanks! One thing i forgot to mention was that if you run this subtest individually, it wouldn't have the previous subtest's settings, resulting in the situation i mentioned above.

@xinhaoz
Copy link
Member Author

xinhaoz commented Sep 13, 2024

TFTR!
bors r+

@craig craig bot merged commit 259cde8 into cockroachdb:master Sep 13, 2024
23 checks passed
@xinhaoz xinhaoz deleted the table-meta-job-settings branch September 20, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/tablemetadatacache: add cluster setting to run table metadata job on an interval
5 participants