jobs: add job to populate system activity tables #100807

j82w · 2023-04-06T13:38:49Z

The system statistics table grow to large for the
ui to quickly return results. The new activity
tables job does the aggregation and only records
the top 500 of each of the 6 columns. This means
for a given hour there is a limit of 3000 rows.
This allows the ui to return results fast and reliably.

If the job detects there are less than 3k
rows it will just copy all the rows to
the activity tables.

Epic: none
closes: #98882

Release note (sql change): Adds a new sql activity
updater job. The job updates the system
transaction_activity and statement_activity
tables based on the statistics tables.

cockroach-teamcity · 2023-04-06T13:39:00Z

This change is

THardy98

Looked over it quickly, just lefts a few nits. Otherwise LGTM

Reviewed 7 of 8 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w)

-- commits line 4 at r1:
nit: to -> too

pkg/sql/sql_activity_update_job.go line 52 at r1 (raw file):

	"sql.stats.activity.aggregation.interval",
	"the interval at which we aggregate SQL activity statistics upon flush, "+
		"this value must be greater than or equal to sql.stats.activity.flush.interval",

can we enforce this in the cluster setting with another validation function?

pkg/sql/sql_activity_update_job.go line 62 at r1 (raw file):

	settings.TenantWritable,
	"sql.stats.activity.top.limit",
	"the limit for the top number of statistics to be flushed to the, "+

nit: this is per aggregation column correct? the description sounds like this is the total number of rows being flushed to the tables

pkg/sql/sql_activity_update_job.go line 138 at r1 (raw file):

}

// NewSqlActivityUpdater returns a new instance of StatsCompactor.

nit: StatsCompactor -> SqlActivityUpdater

pkg/sql/sql_activity_update_job.go line 310 at r1 (raw file):

	totalTxnClusterExecCount int64,
) (retErr error) {
	// Select the top 500 for execution_count, service_latency,

nit: instead of top 500, top N

miretskiy

My review is specific to jobs system. I do want somebody on the SQL team to take a look at the queries though.
They are way too complex in my opinion. But that opinion is not blocking; the jobs system/upgrades stuff is though.

Reviewed 7 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w)

pkg/sql/sql_activity_update_job.go line 31 at r1 (raw file):

// Enabled enables the stats activity flush job.
var Enabled = settings.RegisterBoolSetting(

I'm not sure how useful this setting is considering a) it's true by default, and b) if you want
to disable this job, you can just pause the job?

pkg/sql/sql_activity_update_job.go line 38 at r1 (raw file):

// FlushInterval defines the interval to flush the stats
// to the activity tables.
var FlushInterval = settings.RegisterDurationSetting(

is it necessary to export these settings (ditto below)?

pkg/sql/sql_activity_update_job.go line 101 at r1 (raw file):

		sampleInterval := FlushInterval.Get(&settings.SV)
		select {
		case <-time.After(sampleInterval):

perhaps create timer object instead of using time.After?

pkg/sql/sql_activity_update_job.go line 105 at r1 (raw file):

				updater := NewSqlActivityUpdater(settings, execCtx.ExecCfg().InternalDB)
				if err := updater.TransferStatsToActivity(ctx); err != nil {
					return err

you almost certainly don't want to return an error here, do you? Now, I think because the job
is non-cancellable, the jobs system will not cancel this job after an error; but if you do
return errors, the job system will start to add backoff intervals -- as high as 24 hours on each failure.

What you want instead, is to a) make sure that your flush interval is not too small (it isn't), and you
want to make sure you record stats related to these errors, and of course, log them.

The easiest way to record such stats is to take a look at jobs.WithJobMetrics option (as in
https://github.com/cockroachdb/cockroach/blob/334aeea09fe4fb6a04c42125bebe9a29c4c93459/pkg/jobs/metricspoller/poller.go#L113-L113
)

The same metrics mechanism should also be used when constructing updater above so that the
updater can have useful metrics around its operations.

pkg/sql/sql_activity_update_job.go line 123 at r1 (raw file):

	if jobs.HasErrJobCanceled(jobErr) {
		err := errors.NewAssertionErrorWithWrappedErrf(jobErr,
			"sql activity is not cancelable")

not sure you need this because the job is non-cancellable.
But I guess that's the place where you "eat" the error because you return nil.
So, ignore the note about returning an error from resumer, but still, you probably shouldn't.

pkg/sql/sql_activity_update_job.go line 382 at r1 (raw file):

           GROUP BY ts.app_name,
                    ts.fingerprint_id,
                    ts.agg_interval));

my comments so far has been about the jobs system itself. I hope sql team "sql magic" is
strong, because, in my opinion, queries like the one above are... probably very difficult to
get right, and more importantly to update in the future. But, that's just my opinion, and
I think the other reviewers on the sql side should opine about this.
(and, no, I'm not saying the query is wrong.. it's just very complex)

pkg/sql/sql_activity_update_job.go line 393 at r1 (raw file):

	}

	// Select the top 500 for execution_count, service_latency,

isn't there a setting for the 500 bit?

pkg/upgrade/upgrades/system_statistics_activity.go line 44 at r1 (raw file):

	}

	if d.TestingKnobs != nil && d.TestingKnobs.SkipUpdateSQLActivityJobBootstrap {

I'm pretty sure modifying existing migration functions is a no-go; please add a new migration.

pkg/upgrade/upgrades/system_statistics_activity.go line 58 at r1 (raw file):

	// Make sure job with id doesn't already exist in system.jobs.
	row, err := d.DB.Executor().QueryRowEx(

I invoke the rule of 3: jobs.AutoConfigRunnerJobID, and jobs.JobMetricsPollerJobID, plus this one need to check for existence before they execute; please pull this into a separate helper file/function
and use that instead (createStaticJobIfNotExist or something like that).

pkg/upgrade/upgrades/system_statistics_activity.go line 72 at r1 (raw file):

	// If there isn't a row for the update SQL Activity job, create the job.
	if row == nil {
		if _, err = d.JobRegistry.CreateAdoptableJobWithTxn(ctx, record, record.JobID, nil); err != nil {

using nil transaction is not great since it's not done atomically with the existence check above.

pkg/sql/sql_activity_update_job_test.go line 238 at r1 (raw file):

		var txnAggTs time.Time
		row = db.QueryRowContext(ctx, "SELECT count_rows(), aggregated_ts "+
			"FROM system.public.transaction_statistics AS OF SYSTEM TIME follower_read_timestamp() where app_name like 'TestSqlActivityUpdateJobLoop%' group by aggregated_ts")

nit: perhaps just use backticks and wrap these statements a bit more? Also nit ... either capitalize SQL keywords (or don't), but try to be consistent (WHERE, LIKE, GROUP BY, etc)

pkg/sql/sql_activity_update_job_test.go line 329 at r1 (raw file):

		return nil
	}, 1*time.Minute)

how long does this test actually take to run? is it because you need to execute lots of statements to trigger a
compaction? If that's the case, perhaps provide testing knob to lower the threshold?

dt · 2023-04-06T14:43:56Z

pkg/sql/sql_activity_update_job.go

+
+	for {
+		sampleInterval := FlushInterval.Get(&settings.SV)
+		select {


You probably want to call MarkIdle before going to sleep here, otherwise you'll prevent serverless pod scale-down while you're sleeping. Arguable whether you want to set it back to false while you're inside the active lines below -- it probably doesn't have any effect (the metrics collection period is such that in all likelihood nobody would ever notice that you went to idle=false for the maybe second that you'll spend running queries?)

In the case of keyviz, sounds like a bug.

In the case of stream producer: that job is only ever run by system tenants so it doesn't really matter if it indicates idleness or not since you don't scale down system tenant nodes on idleness.

miretskiy · 2023-04-06T16:06:07Z

The system statistics table grow to large for the ui to quickly return results. The new activity tables job does the aggregation and only records the top 500 of each of the 6 columns. This means for a given hour there is a limit of 3000 rows. This allows the ui to return results fast and reliably.

If the job detects there are less than 3k rows it will just copy all the rows to the activity tables.

Epic: none closes: #98882

Release note: none

Are we sure we don't want any release notes here?

j82w · 2023-04-06T19:10:10Z

pkg/sql/sql_activity_update_job.go line 38 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

is it necessary to export these settings (ditto below)?

Yes, larger customers will likely want to adjust these settings.

j82w · 2023-04-06T19:21:56Z

pkg/sql/sql_activity_update_job.go line 100 at r1 (raw file):

Previously, dt (David Taylor) wrote…

You probably want to call MarkIdle before going to sleep here, otherwise you'll prevent serverless pod scale-down while you're sleeping. Arguable whether you want to set it back to false while you're inside the active lines below -- it probably doesn't have any effect (the metrics collection period is such that in all likelihood nobody would ever notice that you went to idle=false for the maybe second that you'll spend running queries?)

If this is an issue then why don't I see MarkIdle being called on other jobs?
https://github.com/cockroachdb/cockroach/blob/master/pkg/keyvisualizer/keyvisjob/job.go

cockroach/pkg/ccl/streamingccl/streamproducer/producer_job.go

Line 88 in 7d21940

for {

j82w · 2023-04-06T19:24:14Z

pkg/sql/sql_activity_update_job.go line 101 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

perhaps create timer object instead of using time.After?

What is the benefit of using timer instead of time.After?

j82w · 2023-04-06T20:25:42Z

pkg/sql/sql_activity_update_job.go line 100 at r1 (raw file):

Previously, dt (David Taylor) wrote…

In the case of keyviz, sounds like a bug.

In the case of stream producer: that job is only ever run by system tenants so it doesn't really matter if it indicates idleness or not since you don't scale down system tenant nodes on idleness.

Can you file an issue if you think it's a bug? I'm surprised it wasn't caught by any tests. fyi: @zachlite

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @THardy98, and @zachlite)

pkg/sql/sql_activity_update_job.go line 52 at r1 (raw file):

Previously, THardy98 (Thomas Hardy) wrote…

can we enforce this in the cluster setting with another validation function?

Unfortunately no, the validation function only passes in the time.Duration value of the current setting. There is no way to access the cluster setting object needed to get the value from the other setting.

pkg/sql/sql_activity_update_job.go line 393 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

isn't there a setting for the 500 bit?

I forgot to update the comment after making it configurable.

pkg/upgrade/upgrades/system_statistics_activity.go line 44 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I'm pretty sure modifying existing migration functions is a no-go; please add a new migration.

It's a permeant migration, so I was told it should be ok. We need to backport this to 23.1 release so I would like to avoid adding another migration.

pkg/upgrade/upgrades/system_statistics_activity.go line 58 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I invoke the rule of 3: jobs.AutoConfigRunnerJobID, and jobs.JobMetricsPollerJobID, plus this one need to check for existence before they execute; please pull this into a separate helper file/function
and use that instead (createStaticJobIfNotExist or something like that).

@miretskiy can you explain why I need to check those three? I don't see that check being done here:

cockroach/pkg/upgrade/upgrades/key_visualizer_migration.go

Line 84 in 7037eef

// Make sure job with id doesn't already exist in system.jobs.

j82w · 2023-04-06T21:28:41Z

pkg/sql/sql_activity_update_job_test.go line 329 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

how long does this test actually take to run? is it because you need to execute lots of statements to trigger a
compaction? If that's the case, perhaps provide testing knob to lower the threshold?

It's usually around 30 seconds, but it seems like most of the time is waiting for the job to start. Do you know of a way to speed up the job logic?

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @j82w, @THardy98, and @zachlite)

pkg/sql/sql_activity_update_job.go line 38 at r1 (raw file):

Previously, j82w (Jake) wrote…

Yes, larger customers will likely want to adjust these settings.

Sure; but the variable itself -- doesn't need to be exported outside of sql package; does it?

pkg/sql/sql_activity_update_job.go line 52 at r1 (raw file):

Previously, j82w (Jake) wrote…

Unfortunately no, the validation function only passes in the time.Duration value of the current setting. There is no way to access the cluster setting object needed to get the value from the other setting.

why can't this check happen on the timer in your main function?

pkg/upgrade/upgrades/system_statistics_activity.go line 44 at r1 (raw file):

Previously, j82w (Jake) wrote…

It's a permeant migration, so I was told it should be ok. We need to backport this to 23.1 release so I would like to avoid adding another migration.

@dt do you know if this is safe to do? change permanent migration?

pkg/upgrade/upgrades/system_statistics_activity.go line 58 at r1 (raw file):

Previously, j82w (Jake) wrote…

@miretskiy can you explain why I need to check those three? I don't see that check being done here:

cockroach/pkg/upgrade/upgrades/key_visualizer_migration.go

Line 84 in 7037eef

// Make sure job with id doesn't already exist in system.jobs.

But it is done in other places I mention, right?
And in general, migrations ought to be idempotent. It'd be nice to just not repeat boilerplate code.
We had 1 job like this; now we are up to 4.. Time to refactor.

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @j82w, @miretskiy, and @THardy98)

pkg/sql/sql_activity_update_job.go line 100 at r1 (raw file):

Previously, j82w (Jake) wrote…

Can you file an issue if you think it's a bug? I'm surprised it wasn't caught by any tests. fyi: @zachlite

The key vis job is only ever run by system tenants too.

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @j82w, and @THardy98)

pkg/sql/sql_activity_update_job_test.go line 329 at r1 (raw file):

Previously, j82w (Jake) wrote…

It's usually around 30 seconds, but it seems like most of the time is waiting for the job to start. Do you know of a way to speed up the job logic?

yes; you can specify knobs to make job adoption loop run quicker:

cockroach/pkg/jobs/testing_knobs.go

Line 124 in ad7d868

func NewTestingKnobsWithShortIntervals() *TestingKnobs {

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @THardy98)

pkg/sql/sql_activity_update_job.go line 38 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Sure; but the variable itself -- doesn't need to be exported outside of sql package; does it?

I misunderstood. I changed them to private.

pkg/sql/sql_activity_update_job.go line 52 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

why can't this check happen on the timer in your main function?

I added a check.

pkg/upgrade/upgrades/system_statistics_activity.go line 58 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

But it is done in other places I mention, right?
And in general, migrations ought to be idempotent. It'd be nice to just not repeat boilerplate code.
We had 1 job like this; now we are up to 4.. Time to refactor.

Fixed in the latest iteration. It's now a new method.

pkg/upgrade/upgrades/system_statistics_activity.go line 72 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

using nil transaction is not great since it's not done atomically with the existence check above.

Done.

miretskiy

Reviewed 2 of 10 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, and @THardy98)

pkg/jobs/registry.go line 707 at r3 (raw file):

}

func (r *Registry) CreateIfNotExistAdoptableJobWithTxn(

please add a comment for exported functions.

pkg/sql/sql_activity_update_job.go line 38 at r1 (raw file):

Previously, j82w (Jake) wrote…

I misunderstood. I changed them to private.

ack. thanks.

pkg/sql/sql_activity_update_job.go line 101 at r1 (raw file):

Previously, j82w (Jake) wrote…

What is the benefit of using timer instead of time.After?

In this case, probably not much of a difference; in general, if you had to select on other channels, the problem w/
time.After is that the underlying timer is not garbage collected until the timer fires -- thus, if you have a hot
loop, it will add up quickly.
Another benefit of timer is that it's easy to test since you can inject timerI.
Because of the above, I simply prefer not to use time.After() because it's use always puts the load on the reviewer
to understand whether a particular use is safe or not.

j82w · 2023-04-10T20:53:34Z

pkg/sql/sql_activity_update_job.go line 101 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

In this case, probably not much of a difference; in general, if you had to select on other channels, the problem w/
time.After is that the underlying timer is not garbage collected until the timer fires -- thus, if you have a hot
loop, it will add up quickly.
Another benefit of timer is that it's easy to test since you can inject timerI.
Because of the above, I simply prefer not to use time.After() because it's use always puts the load on the reviewer
to understand whether a particular use is safe or not.

Done.

j82w · 2023-04-10T20:53:46Z

pkg/sql/sql_activity_update_job.go line 100 at r1 (raw file):

Previously, zachlite wrote…

The key vis job is only ever run by system tenants too.

Done.

j82w · 2023-04-11T19:25:04Z

pkg/sql/sql_activity_update_job.go line 105 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

you almost certainly don't want to return an error here, do you? Now, I think because the job
is non-cancellable, the jobs system will not cancel this job after an error; but if you do
return errors, the job system will start to add backoff intervals -- as high as 24 hours on each failure.

What you want instead, is to a) make sure that your flush interval is not too small (it isn't), and you
want to make sure you record stats related to these errors, and of course, log them.

The easiest way to record such stats is to take a look at jobs.WithJobMetrics option (as in
https://github.com/cockroachdb/cockroach/blob/334aeea09fe4fb6a04c42125bebe9a29c4c93459/pkg/jobs/metricspoller/poller.go#L113-L113
)

The same metrics mechanism should also be used when constructing updater above so that the
updater can have useful metrics around its operations.

Done.

j82w · 2023-04-12T16:55:18Z

pkg/sql/sql_activity_update_job.go line 271 at r6 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is this right? why you need an empty string?

The column is not null. The problem is only the statement_statistics has the query text. Instead of joining on statement_statistics my plan is do a join on statement_activity, but that is only populated after the transaction_activity is done to ensure that all the stmts for a transaction are included. I can work on adding the update method to populate the query text field.

j82w · 2023-04-12T17:03:36Z

pkg/sql/sql_activity_update_job.go line 123 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

not sure you need this because the job is non-cancellable.
But I guess that's the place where you "eat" the error because you return nil.
So, ignore the note about returning an error from resumer, but still, you probably shouldn't.

Other jobs seem to have this, so I will keep it just incase there is some edge case.

j82w · 2023-04-12T17:29:55Z

pkg/sql/sql_activity_update_job.go line 46 at r6 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

you mention you used the 15min to avoid conflict with the flush, but doesn't this value means that as the 30min mark you will have both at the same time?
Wondering how to define the best frequency, since 15min can be a lot to wait to see data, specially if they already waited 10min for the flush, so now we're talking up to 25min worst case.

Do you have a suggestion?
There are 2 ways I can see reducing the latency.

Lower the interval to 10min and just find a way to offset it by 1 min from the stats flush.
Decrease the flush interval to like 5 minutes, and then reduce the activity updater.
Do it on demand. Update the table when ever a user requests it and the last update occurred more than 5 minutes ago. The downside is the initial latency would still be pretty bad.

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, @miretskiy, and @THardy98)

pkg/sql/sql_activity_update_job.go line 46 at r6 (raw file):

Previously, j82w (Jake) wrote…

Do you have a suggestion?
There are 2 ways I can see reducing the latency.

Lower the interval to 10min and just find a way to offset it by 1 min from the stats flush.

Decrease the flush interval to like 5 minutes, and then reduce the activity updater.

Do it on demand. Update the table when ever a user requests it and the last update occurred more than 5 minutes ago. The downside is the initial latency would still be pretty bad.

I think for this PR you can keep this frequency, so we unblock the GA blocker, but then more tests can be done and select a frequency using actual data as the base
I'm leaning to option1 from your list, I think the process can take more than 5 min, so could be risky, and the 3 could cause latency which is what we're trying to avoid.

maryliag

Great job on this job 😄
How the job works and the querys look all good to me. Maybe @miretskiy should take another look since you change how/when the job runs.

Otherwise:

Reviewed 14 of 14 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, @miretskiy, and @THardy98)

pkg/sql/sql_activity_update_job.go line 37 at r7 (raw file):

	settings.SystemOnly,
	"sql.stats.activity.flush.enabled",
	"enable the flush to the system statement and transaction activity tables", true)

curious that this is not giving a lint error that the true value should be on the next line

pkg/sql/sql_activity_update_job.go line 643 at r7 (raw file):

	// Delete all the rows older than on the oldest statement_activity aggregated_ts.
	// This makes sure that the 2 tables are always in sync.

nice!

rharding6373

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, and @THardy98)

pkg/jobs/registry.go line 714 at r7 (raw file):

		txn.KV(),
		sessiondata.InternalExecutorOverride{User: username.RootUserName()},
		"SELECT * FROM system.jobs WHERE id = $1",

Since we only want to know if a row exists, we can use SELECT id here. There's an index on id, so we won't need to fetch extra columns.

pkg/sql/sql_activity_update_job.go line 218 at r7 (raw file):

	// Only transfer the top sql.stats.activity.top.max for each of
	// the 6 most popular columns

Why 6? How do we know the 6 most popular columns?

pkg/sql/sql_activity_update_job.go line 237 at r7 (raw file):

		sessiondata.NodeUserSessionDataOverride,
		`
			UPSERT INTO system.public.transaction_activity

Could you add an execbuilder test with all of these large queries (I count 4 of them)? This will make it easier to review and also for the SQL Queries team to monitor plan changes that may affect performance over time. You could make a new file pkg/sql/opt/exec/execbuilder/testdata/observability containing these queries. These are a type of logic test, so the test formatting may be familiar (or you can look through the other tests in execbuilder/testdata for reference). Once the queries are in you can generate the plans with dev test pkg/sql/opt/exec/execbuilder/tests/local -f=TestExecBuilder_observability --rewrite. Feel free to slack me if you run into trouble with this. Thanks!

pkg/jobs/registry_test.go line 475 at r7 (raw file):

				txn.KV(),
				sessiondata.InternalExecutorOverride{User: username.RootUserName()},
				"SELECT * FROM system.jobs WHERE id = $1",

We can also use SELECT id here.

pkg/sql/sql_activity_update_job_test.go line 238 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

nit: perhaps just use backticks and wrap these statements a bit more? Also nit ... either capitalize SQL keywords (or don't), but try to be consistent (WHERE, LIKE, GROUP BY, etc)

Please capitalize keywords! It makes the SQL more readable. I think that there are still some instances of "like" (vs LIKE) in the queries here.

pkg/sql/sql_activity_update_job_test.go line 178 at r7 (raw file):

	}

	_, err = db.ExecContext(ctx, "SET CLUSTER SETTING sql.stats.activity.top.max = 5")

Would it be worth testing that after changing this setting it does what's expected?

pkg/upgrade/upgrades/system_activity_update_job_test.go line 66 at r7 (raw file):

	)

	row := db.QueryRow("select count(*) from system.public.jobs where id = 103")

nit: SELECT count(*) FROM system.public.jobs WHERE id = 103

miretskiy · 2023-04-18T20:17:19Z

pkg/sql/sql_activity_update_job.go

+}
+
+type activityUpdaterMetrics struct {
+	numErrors *metric.Counter


you must export numErrors (capitalize) if you want those to show up as metrics. MetricStruct ignore any unexported metrics.

miretskiy · 2023-04-18T20:18:11Z

pkg/sql/sql_activity_update_job.go

+	if jobs.HasErrJobCanceled(jobErr) {
+		err := errors.NewAssertionErrorWithWrappedErrf(jobErr,
+			"sql activity is not cancelable")
+		log.Infof(ctx, "%v", err)


log.Error perhaps? Definitely an error...

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, and @THardy98)

pkg/sql/sql_activity_update_job.go line 46 at r6 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

I think for this PR you can keep this frequency, so we unblock the GA blocker, but then more tests can be done and select a frequency using actual data as the base
I'm leaning to option1 from your list, I think the process can take more than 5 min, so could be risky, and the 3 could cause latency which is what we're trying to avoid.

I updated it to always execute once the node the job is on does a flush. It simplified the execution logic.

pkg/sql/sql_activity_update_job.go line 67 at r6 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

to the activity (some extra space comma and the)

Done.

pkg/sql/sql_activity_update_job.go line 371 at r6 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

you're missing the total execution time from your list here

Done.

pkg/sql/sql_activity_update_job.go line 454 at r6 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

you're missing the total execution time (that we use for the % of all runtime)

Done.

pkg/sql/sql_activity_update_job.go line 218 at r7 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Why 6? How do we know the 6 most popular columns?

There are too many columns to do all of them. We chose to only optimize for the most common columns that are sorted based on feedback and telemetry.

pkg/jobs/registry_test.go line 475 at r7 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

We can also use SELECT id here.

Done.

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, and @THardy98)

pkg/sql/sql_activity_update_job.go line 129 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

you must export numErrors (capitalize) if you want those to show up as metrics. MetricStruct ignore any unexported metrics.

Done.

pkg/sql/sql_activity_update_job.go line 154 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

log.Error perhaps? Definitely an error...

Done.

pkg/sql/sql_activity_update_job.go line 237 at r7 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Could you add an execbuilder test with all of these large queries (I count 4 of them)? This will make it easier to review and also for the SQL Queries team to monitor plan changes that may affect performance over time. You could make a new file pkg/sql/opt/exec/execbuilder/testdata/observability containing these queries. These are a type of logic test, so the test formatting may be familiar (or you can look through the other tests in execbuilder/testdata for reference). Once the queries are in you can generate the plans with dev test pkg/sql/opt/exec/execbuilder/tests/local -f=TestExecBuilder_observability --rewrite. Feel free to slack me if you run into trouble with this. Thanks!

Done.

rharding6373

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, and @THardy98)

pkg/sql/sql_activity_update_job.go line 347 at r9 (raw file):

	totalTxnClusterExecCount int64,
) (retErr error) {
	// Select the top 500 (controlled by sql.stats.activity.top.max) for

minor nit: I think changing the wording slightly to "Select the top 500...for each of ..." would add a bit of clarity. After listing all of them, it may also help to add "Up to 3000 rows (sql.stats.activity.top.max * 6) may be added to transaction_activity." to hammer that point home.

pkg/sql/sql_activity_update_job.go line 433 at r9 (raw file):

	// execution_count, total execution time, service_latency, cpu_sql_nanos,
	// contention_time, p99_latency, and any statement that was in the
	// top N transactions and insert into statement_activity table.

Does this mean that any row we upserted in the last query should also be upserted in the following query? If so, I don't fully understand how the following query is achieving that. Also if so, I wonder if we can utilize materialized views to dedupe some work.

Code quote:

	// contention_time, p99_latency, and any statement that was in the
	// top N transactions and insert into statement_activity table.

pkg/sql/opt/exec/execbuilder/testdata/observability line 157 at r9 (raw file):

                                    agg_interval,
                                    max(metadata)                                                AS metadata,
                                    crdb_internal.merge_transaction_stats(array_agg(statistics)) AS statistics,

Should this be merge_statement_stats? Here and in the other statement activity upsert below.

pkg/sql/opt/exec/execbuilder/testdata/observability line 487 at r9 (raw file):

                                                                                      spans: /2023-04-10T16:00:00Z-/2023-04-10T16:00:00.000000001Z

# Upsert statement_activity

s/statement/transaction

j82w · 2023-04-20T01:54:44Z

pkg/sql/sql_activity_update_job.go line 433 at r9 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Does this mean that any row we upserted in the last query should also be upserted in the following query? If so, I don't fully understand how the following query is achieving that. Also if so, I wonder if we can utilize materialized views to dedupe some work.

The issue with materialized view is it is part of the schema which mean the logic can only be changed on major versions. If we need to change it to only load the top 10 or if we need to increase it to top 1000 it won't be possible. It would still require a job to run periodically to update the mv.

There are 2 tables. One for statement and one for transaction. It first inserts the top N for the transactions. A transaction can have multiple statements. On the statement table we want the top 500, and we also want all the statements that were part of the top N transactions. That way on the ui if a user selects a transaction all the statement information is available for that transaction.

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, and @THardy98)

pkg/sql/sql_activity_update_job.go line 37 at r7 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

curious that this is not giving a lint error that the true value should be on the next line

Done.

pkg/sql/opt/exec/execbuilder/testdata/observability line 157 at r9 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Should this be merge_statement_stats? Here and in the other statement activity upsert below.

Done.

pkg/sql/opt/exec/execbuilder/testdata/observability line 487 at r9 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

s/statement/transaction

Done.

j82w · 2023-04-20T14:08:16Z

pkg/sql/sql_activity_update_job_test.go line 178 at r7 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Would it be worth testing that after changing this setting it does what's expected?

This was moved to TestSqlActivityUpdateTopLimitJob to avoid the test being to large.

xinhaoz · 2023-04-20T14:53:51Z

-- commits line 9 at r11:
nit: reliably

xinhaoz

Will this process continue to work if we change the aggregation interval of sql stats?

Reviewed 1 of 8 files at r1, 5 of 13 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, @rharding6373, and @THardy98)

pkg/sql/sql_activity_update_job.go line 52 at r1 (raw file):

Previously, j82w (Jake) wrote…

I added a check.

This might become out of sync with the aggregation interval cluster setting used for system table writes -- is that okay?

pkg/sql/sql_activity_update_job.go line 63 at r11 (raw file):

// sqlStatsActivityMaxPersistedRows specifies maximum number of rows that will be
// retained in system.statement_statistics and system.transaction_statistics.

nit: should these be the activity tables?

pkg/sql/sql_activity_update_job.go line 595 at r11 (raw file):

// ComputeAggregatedTs returns the aggregation timestamp to assign
// in-memory SQL stats during storage or aggregation.
func (u *SqlActivityUpdater) ComputeAggregatedTs(

nit: does this function need to be exported?

pkg/sql/sql_activity_update_job.go line 605 at r11 (raw file):

	// to or greater than the statistics table.
	if interval < intervalSqlStats {
		log.Warningf(ctx, "sql.stats.activity.aggregation.interval: %s must be greater than or equal to sql.stats.aggregation.interval: %s", interval, intervalSqlStats)

Why do we have the additional sql.stats.activity.aggregation_interval?

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, @THardy98, and @xinhaoz)

pkg/jobs/registry.go line 714 at r7 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Since we only want to know if a row exists, we can use SELECT id here. There's an index on id, so we won't need to fetch extra columns.

Done.

pkg/sql/sql_activity_update_job.go line 52 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

This might become out of sync with the aggregation interval cluster setting used for system table writes -- is that okay?

I figured it give flexibility if we want the activity page to have a bigger aggregation interval than system tables. I went ahead and removed it. We can always add it back later if necessary.

pkg/sql/sql_activity_update_job.go line 63 at r11 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

nit: should these be the activity tables? Why does

Fixed

pkg/sql/sql_activity_update_job.go line 605 at r11 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Why do we have the additional sql.stats.activity.aggregation_interval?

I removed it.

rharding6373

for sql queries

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, @THardy98, and @xinhaoz)

pkg/sql/sql_activity_update_job.go line 433 at r9 (raw file):

Previously, j82w (Jake) wrote…

The issue with materialized view is it is part of the schema which mean the logic can only be changed on major versions. If we need to change it to only load the top 10 or if we need to increase it to top 1000 it won't be possible. It would still require a job to run periodically to update the mv.

There are 2 tables. One for statement and one for transaction. It first inserts the top N for the transactions. A transaction can have multiple statements. On the statement table we want the top 500, and we also want all the statements that were part of the top N transactions. That way on the ui if a user selects a transaction all the statement information is available for that transaction.

Ok, thanks.

pkg/sql/opt/exec/execbuilder/testdata/observability line 157 at r9 (raw file):

Previously, j82w (Jake) wrote…

Done.

I still see we're using crdb_internal.merge_transaction_stats here, although statistics comes from the statement_statistics. Is this expected? It looks like this might have been fixed in pkg/sql/sql_activity_update_job.go.

The system statistics table grow too large for the ui to quickly return results. The new activity tables job does the aggregation and only records the top 500 of each of the 6 columns. This means for a given hour there is a limit of 3000 rows. This allows the ui to return results fast and reliably. If the job detects there are less than 3k rows it will just copy all the rows to the activity tables. Epic: none closes: #98882 Release note (sql change): Adds a new sql activity updater job. The job updates the system transaction_activity and statement_activity tables based on the statistics tables.

j82w · 2023-04-20T19:25:17Z

pkg/sql/opt/exec/execbuilder/testdata/observability line 157 at r9 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

I still see we're using crdb_internal.merge_transaction_stats here, although statistics comes from the statement_statistics. Is this expected? It looks like this might have been fixed in pkg/sql/sql_activity_update_job.go.

Fixed, I just missed this one. I double checked the others.

j82w · 2023-04-20T22:23:39Z

bors r+

craig · 2023-04-20T23:20:15Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-04-20T23:20:25Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 53afe66 to blathers/backport-release-23.1-100807: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

error creating merge commit from 53afe66 to blathers/backport-release-23.1.0-100807: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.0 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

j82w requested review from a team April 6, 2023 13:38

j82w requested a review from a team as a code owner April 6, 2023 13:38

THardy98 reviewed Apr 6, 2023

View reviewed changes

miretskiy suggested changes Apr 6, 2023

View reviewed changes

dt reviewed Apr 6, 2023

View reviewed changes

j82w commented Apr 6, 2023

View reviewed changes

miretskiy requested review from dt, THardy98 and zachlite April 7, 2023 19:16

miretskiy suggested changes Apr 7, 2023

View reviewed changes

zachlite reviewed Apr 10, 2023

View reviewed changes

miretskiy reviewed Apr 10, 2023

View reviewed changes

j82w commented Apr 10, 2023

View reviewed changes

j82w requested review from a team and cucaroach April 10, 2023 17:53

miretskiy approved these changes Apr 10, 2023

View reviewed changes

j82w requested a review from a team as a code owner April 10, 2023 20:39

j82w requested a review from a team April 10, 2023 21:01

j82w requested a review from a team as a code owner April 11, 2023 20:39

maryliag reviewed Apr 12, 2023

View reviewed changes

maryliag approved these changes Apr 18, 2023

View reviewed changes

rharding6373 reviewed Apr 18, 2023

View reviewed changes

miretskiy reviewed Apr 18, 2023

View reviewed changes

rharding6373 self-requested a review April 18, 2023 20:19

j82w commented Apr 19, 2023

View reviewed changes

rharding6373 reviewed Apr 19, 2023

View reviewed changes

j82w commented Apr 20, 2023

View reviewed changes

xinhaoz reviewed Apr 20, 2023

View reviewed changes

j82w commented Apr 20, 2023

View reviewed changes

rharding6373 approved these changes Apr 20, 2023

View reviewed changes

j82w added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.1.0 labels Apr 20, 2023

craig bot merged commit 5ae39d9 into cockroachdb:master Apr 20, 2023

cockroach-teamcity mentioned this pull request Apr 21, 2023

PR #100807 - jobs: add job to populate system activity tables cockroachdb/docs#16820

Closed

This was referenced Apr 21, 2023

release-23.1.0: jobs: add job to populate system activity tables #101995

Merged

release-23.1: jobs: add job to populate system activity tables #101996

Merged

jobs: add job to populate system activity tables #100807

jobs: add job to populate system activity tables #100807

Conversation

j82w commented Apr 6, 2023 • edited Loading

cockroach-teamcity commented Apr 6, 2023

THardy98 left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

dt Apr 6, 2023

Choose a reason for hiding this comment

dt Apr 6, 2023

Choose a reason for hiding this comment

miretskiy commented Apr 6, 2023

j82w commented Apr 6, 2023

j82w commented Apr 6, 2023

j82w commented Apr 6, 2023

j82w commented Apr 6, 2023

j82w left a comment

Choose a reason for hiding this comment

j82w commented Apr 6, 2023

miretskiy left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

j82w left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

j82w commented Apr 10, 2023

j82w commented Apr 10, 2023

j82w commented Apr 11, 2023

j82w commented Apr 12, 2023

j82w commented Apr 12, 2023

j82w commented Apr 12, 2023

maryliag left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

miretskiy Apr 18, 2023

Choose a reason for hiding this comment

miretskiy Apr 18, 2023

Choose a reason for hiding this comment

j82w left a comment

Choose a reason for hiding this comment

j82w left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

j82w commented Apr 20, 2023

j82w left a comment

Choose a reason for hiding this comment

j82w commented Apr 20, 2023

xinhaoz commented Apr 20, 2023

xinhaoz left a comment • edited Loading

Choose a reason for hiding this comment

j82w left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

j82w commented Apr 20, 2023

j82w commented Apr 20, 2023

craig bot commented Apr 20, 2023

blathers-crl bot commented Apr 20, 2023

j82w commented Apr 6, 2023 •

edited

Loading

xinhaoz left a comment •

edited

Loading