-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: add job to populate system activity tables #100807
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked over it quickly, just lefts a few nits. Otherwise LGTM
Reviewed 7 of 8 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w)
-- commits
line 4 at r1:
nit: to -> too
pkg/sql/sql_activity_update_job.go
line 52 at r1 (raw file):
"sql.stats.activity.aggregation.interval", "the interval at which we aggregate SQL activity statistics upon flush, "+ "this value must be greater than or equal to sql.stats.activity.flush.interval",
can we enforce this in the cluster setting with another validation function?
pkg/sql/sql_activity_update_job.go
line 62 at r1 (raw file):
settings.TenantWritable, "sql.stats.activity.top.limit", "the limit for the top number of statistics to be flushed to the, "+
nit: this is per aggregation column correct? the description sounds like this is the total number of rows being flushed to the tables
pkg/sql/sql_activity_update_job.go
line 138 at r1 (raw file):
} // NewSqlActivityUpdater returns a new instance of StatsCompactor.
nit: StatsCompactor -> SqlActivityUpdater
pkg/sql/sql_activity_update_job.go
line 310 at r1 (raw file):
totalTxnClusterExecCount int64, ) (retErr error) { // Select the top 500 for execution_count, service_latency,
nit: instead of top 500, top N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review is specific to jobs system. I do want somebody on the SQL team to take a look at the queries though.
They are way too complex in my opinion. But that opinion is not blocking; the jobs system/upgrades stuff is though.
Reviewed 7 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w)
pkg/sql/sql_activity_update_job.go
line 31 at r1 (raw file):
// Enabled enables the stats activity flush job. var Enabled = settings.RegisterBoolSetting(
I'm not sure how useful this setting is considering a) it's true by default, and b) if you want
to disable this job, you can just pause the job?
pkg/sql/sql_activity_update_job.go
line 38 at r1 (raw file):
// FlushInterval defines the interval to flush the stats // to the activity tables. var FlushInterval = settings.RegisterDurationSetting(
is it necessary to export these settings (ditto below)?
pkg/sql/sql_activity_update_job.go
line 101 at r1 (raw file):
sampleInterval := FlushInterval.Get(&settings.SV) select { case <-time.After(sampleInterval):
perhaps create timer object instead of using time.After?
pkg/sql/sql_activity_update_job.go
line 105 at r1 (raw file):
updater := NewSqlActivityUpdater(settings, execCtx.ExecCfg().InternalDB) if err := updater.TransferStatsToActivity(ctx); err != nil { return err
you almost certainly don't want to return an error here, do you? Now, I think because the job
is non-cancellable, the jobs system will not cancel this job after an error; but if you do
return errors, the job system will start to add backoff intervals -- as high as 24 hours on each failure.
What you want instead, is to a) make sure that your flush interval is not too small (it isn't), and you
want to make sure you record stats related to these errors, and of course, log them.
The easiest way to record such stats is to take a look at jobs.WithJobMetrics
option (as in
https://github.com/cockroachdb/cockroach/blob/334aeea09fe4fb6a04c42125bebe9a29c4c93459/pkg/jobs/metricspoller/poller.go#L113-L113
)
The same metrics mechanism should also be used when constructing updater above so that the
updater can have useful metrics around its operations.
pkg/sql/sql_activity_update_job.go
line 123 at r1 (raw file):
if jobs.HasErrJobCanceled(jobErr) { err := errors.NewAssertionErrorWithWrappedErrf(jobErr, "sql activity is not cancelable")
not sure you need this because the job is non-cancellable.
But I guess that's the place where you "eat" the error because you return nil.
So, ignore the note about returning an error from resumer, but still, you probably shouldn't.
pkg/sql/sql_activity_update_job.go
line 382 at r1 (raw file):
GROUP BY ts.app_name, ts.fingerprint_id, ts.agg_interval));
my comments so far has been about the jobs system itself. I hope sql team "sql magic" is
strong, because, in my opinion, queries like the one above are... probably very difficult to
get right, and more importantly to update in the future. But, that's just my opinion, and
I think the other reviewers on the sql side should opine about this.
(and, no, I'm not saying the query is wrong.. it's just very complex)
pkg/sql/sql_activity_update_job.go
line 393 at r1 (raw file):
} // Select the top 500 for execution_count, service_latency,
isn't there a setting for the 500 bit?
pkg/upgrade/upgrades/system_statistics_activity.go
line 44 at r1 (raw file):
} if d.TestingKnobs != nil && d.TestingKnobs.SkipUpdateSQLActivityJobBootstrap {
I'm pretty sure modifying existing migration functions is a no-go; please add a new migration.
pkg/upgrade/upgrades/system_statistics_activity.go
line 58 at r1 (raw file):
// Make sure job with id doesn't already exist in system.jobs. row, err := d.DB.Executor().QueryRowEx(
I invoke the rule of 3: jobs.AutoConfigRunnerJobID, and jobs.JobMetricsPollerJobID, plus this one need to check for existence before they execute; please pull this into a separate helper file/function
and use that instead (createStaticJobIfNotExist
or something like that).
pkg/upgrade/upgrades/system_statistics_activity.go
line 72 at r1 (raw file):
// If there isn't a row for the update SQL Activity job, create the job. if row == nil { if _, err = d.JobRegistry.CreateAdoptableJobWithTxn(ctx, record, record.JobID, nil); err != nil {
using nil transaction is not great since it's not done atomically with the existence check above.
pkg/sql/sql_activity_update_job_test.go
line 238 at r1 (raw file):
var txnAggTs time.Time row = db.QueryRowContext(ctx, "SELECT count_rows(), aggregated_ts "+ "FROM system.public.transaction_statistics AS OF SYSTEM TIME follower_read_timestamp() where app_name like 'TestSqlActivityUpdateJobLoop%' group by aggregated_ts")
nit: perhaps just use backticks and wrap these statements a bit more? Also nit ... either capitalize SQL keywords (or don't), but try to be consistent (WHERE, LIKE, GROUP BY, etc)
pkg/sql/sql_activity_update_job_test.go
line 329 at r1 (raw file):
return nil }, 1*time.Minute)
how long does this test actually take to run? is it because you need to execute lots of statements to trigger a
compaction? If that's the case, perhaps provide testing knob to lower the threshold?
|
||
for { | ||
sampleInterval := FlushInterval.Get(&settings.SV) | ||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to call MarkIdle
before going to sleep here, otherwise you'll prevent serverless pod scale-down while you're sleeping. Arguable whether you want to set it back to false while you're inside the active lines below -- it probably doesn't have any effect (the metrics collection period is such that in all likelihood nobody would ever notice that you went to idle=false for the maybe second that you'll spend running queries?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of keyviz, sounds like a bug.
In the case of stream producer: that job is only ever run by system tenants so it doesn't really matter if it indicates idleness or not since you don't scale down system tenant nodes on idleness.
Are we sure we don't want any release notes here? |
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Yes, larger customers will likely want to adjust these settings. |
Previously, dt (David Taylor) wrote…
If this is an issue then why don't I see MarkIdle being called on other jobs?
|
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
What is the benefit of using timer instead of time.After? |
Previously, dt (David Taylor) wrote…
Can you file an issue if you think it's a bug? I'm surprised it wasn't caught by any tests. fyi: @zachlite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @THardy98, and @zachlite)
pkg/sql/sql_activity_update_job.go
line 52 at r1 (raw file):
Previously, THardy98 (Thomas Hardy) wrote…
can we enforce this in the cluster setting with another validation function?
Unfortunately no, the validation function only passes in the time.Duration value of the current setting. There is no way to access the cluster setting object needed to get the value from the other setting.
pkg/sql/sql_activity_update_job.go
line 393 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
isn't there a setting for the 500 bit?
I forgot to update the comment after making it configurable.
pkg/upgrade/upgrades/system_statistics_activity.go
line 44 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I'm pretty sure modifying existing migration functions is a no-go; please add a new migration.
It's a permeant migration, so I was told it should be ok. We need to backport this to 23.1 release so I would like to avoid adding another migration.
pkg/upgrade/upgrades/system_statistics_activity.go
line 58 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I invoke the rule of 3: jobs.AutoConfigRunnerJobID, and jobs.JobMetricsPollerJobID, plus this one need to check for existence before they execute; please pull this into a separate helper file/function
and use that instead (createStaticJobIfNotExist
or something like that).
@miretskiy can you explain why I need to check those three? I don't see that check being done here:
// Make sure job with id doesn't already exist in system.jobs. |
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
It's usually around 30 seconds, but it seems like most of the time is waiting for the job to start. Do you know of a way to speed up the job logic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @j82w, @THardy98, and @zachlite)
pkg/sql/sql_activity_update_job.go
line 38 at r1 (raw file):
Previously, j82w (Jake) wrote…
Yes, larger customers will likely want to adjust these settings.
Sure; but the variable itself -- doesn't need to be exported outside of sql package; does it?
pkg/sql/sql_activity_update_job.go
line 52 at r1 (raw file):
Previously, j82w (Jake) wrote…
Unfortunately no, the validation function only passes in the time.Duration value of the current setting. There is no way to access the cluster setting object needed to get the value from the other setting.
why can't this check happen on the timer in your main function?
pkg/upgrade/upgrades/system_statistics_activity.go
line 44 at r1 (raw file):
Previously, j82w (Jake) wrote…
It's a permeant migration, so I was told it should be ok. We need to backport this to 23.1 release so I would like to avoid adding another migration.
@dt do you know if this is safe to do? change permanent migration?
pkg/upgrade/upgrades/system_statistics_activity.go
line 58 at r1 (raw file):
Previously, j82w (Jake) wrote…
@miretskiy can you explain why I need to check those three? I don't see that check being done here:
// Make sure job with id doesn't already exist in system.jobs.
But it is done in other places I mention, right?
And in general, migrations ought to be idempotent. It'd be nice to just not repeat boilerplate code.
We had 1 job like this; now we are up to 4.. Time to refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @j82w, @miretskiy, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 100 at r1 (raw file):
Previously, j82w (Jake) wrote…
Can you file an issue if you think it's a bug? I'm surprised it wasn't caught by any tests. fyi: @zachlite
The key vis job is only ever run by system tenants too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @j82w, and @THardy98)
pkg/sql/sql_activity_update_job_test.go
line 329 at r1 (raw file):
Previously, j82w (Jake) wrote…
It's usually around 30 seconds, but it seems like most of the time is waiting for the job to start. Do you know of a way to speed up the job logic?
yes; you can specify knobs to make job adoption loop run quicker:
cockroach/pkg/jobs/testing_knobs.go
Line 124 in ad7d868
func NewTestingKnobsWithShortIntervals() *TestingKnobs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 38 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Sure; but the variable itself -- doesn't need to be exported outside of sql package; does it?
I misunderstood. I changed them to private.
pkg/sql/sql_activity_update_job.go
line 52 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
why can't this check happen on the timer in your main function?
I added a check.
pkg/upgrade/upgrades/system_statistics_activity.go
line 58 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
But it is done in other places I mention, right?
And in general, migrations ought to be idempotent. It'd be nice to just not repeat boilerplate code.
We had 1 job like this; now we are up to 4.. Time to refactor.
Fixed in the latest iteration. It's now a new method.
pkg/upgrade/upgrades/system_statistics_activity.go
line 72 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
using nil transaction is not great since it's not done atomically with the existence check above.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 10 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, and @THardy98)
pkg/jobs/registry.go
line 707 at r3 (raw file):
} func (r *Registry) CreateIfNotExistAdoptableJobWithTxn(
please add a comment for exported functions.
pkg/sql/sql_activity_update_job.go
line 38 at r1 (raw file):
Previously, j82w (Jake) wrote…
I misunderstood. I changed them to private.
ack. thanks.
pkg/sql/sql_activity_update_job.go
line 101 at r1 (raw file):
Previously, j82w (Jake) wrote…
What is the benefit of using timer instead of time.After?
In this case, probably not much of a difference; in general, if you had to select on other channels, the problem w/
time.After is that the underlying timer is not garbage collected until the timer fires -- thus, if you have a hot
loop, it will add up quickly.
Another benefit of timer is that it's easy to test since you can inject timerI.
Because of the above, I simply prefer not to use time.After() because it's use always puts the load on the reviewer
to understand whether a particular use is safe or not.
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Done. |
Previously, zachlite wrote…
Done. |
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Done. |
Previously, maryliag (Marylia Gutierrez) wrote…
The column is not null. The problem is only the statement_statistics has the query text. Instead of joining on statement_statistics my plan is do a join on statement_activity, but that is only populated after the transaction_activity is done to ensure that all the stmts for a transaction are included. I can work on adding the update method to populate the query text field. |
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Other jobs seem to have this, so I will keep it just incase there is some edge case. |
Previously, maryliag (Marylia Gutierrez) wrote…
Do you have a suggestion?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, @miretskiy, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 46 at r6 (raw file):
Previously, j82w (Jake) wrote…
Do you have a suggestion?
There are 2 ways I can see reducing the latency.
- Lower the interval to 10min and just find a way to offset it by 1 min from the stats flush.
- Decrease the flush interval to like 5 minutes, and then reduce the activity updater.
- Do it on demand. Update the table when ever a user requests it and the last update occurred more than 5 minutes ago. The downside is the initial latency would still be pretty bad.
I think for this PR you can keep this frequency, so we unblock the GA blocker, but then more tests can be done and select a frequency using actual data as the base
I'm leaning to option1 from your list, I think the process can take more than 5 min, so could be risky, and the 3 could cause latency which is what we're trying to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job on this job 😄
How the job works and the querys look all good to me. Maybe @miretskiy should take another look since you change how/when the job runs.
Reviewed 14 of 14 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, @miretskiy, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 37 at r7 (raw file):
settings.SystemOnly, "sql.stats.activity.flush.enabled", "enable the flush to the system statement and transaction activity tables", true)
curious that this is not giving a lint error that the true
value should be on the next line
pkg/sql/sql_activity_update_job.go
line 643 at r7 (raw file):
// Delete all the rows older than on the oldest statement_activity aggregated_ts. // This makes sure that the 2 tables are always in sync.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, and @THardy98)
pkg/jobs/registry.go
line 714 at r7 (raw file):
txn.KV(), sessiondata.InternalExecutorOverride{User: username.RootUserName()}, "SELECT * FROM system.jobs WHERE id = $1",
Since we only want to know if a row exists, we can use SELECT id
here. There's an index on id, so we won't need to fetch extra columns.
pkg/sql/sql_activity_update_job.go
line 218 at r7 (raw file):
// Only transfer the top sql.stats.activity.top.max for each of // the 6 most popular columns
Why 6? How do we know the 6 most popular columns?
pkg/sql/sql_activity_update_job.go
line 237 at r7 (raw file):
sessiondata.NodeUserSessionDataOverride, ` UPSERT INTO system.public.transaction_activity
Could you add an execbuilder test with all of these large queries (I count 4 of them)? This will make it easier to review and also for the SQL Queries team to monitor plan changes that may affect performance over time. You could make a new file pkg/sql/opt/exec/execbuilder/testdata/observability containing these queries. These are a type of logic test, so the test formatting may be familiar (or you can look through the other tests in execbuilder/testdata for reference). Once the queries are in you can generate the plans with dev test pkg/sql/opt/exec/execbuilder/tests/local -f=TestExecBuilder_observability --rewrite
. Feel free to slack me if you run into trouble with this. Thanks!
pkg/jobs/registry_test.go
line 475 at r7 (raw file):
txn.KV(), sessiondata.InternalExecutorOverride{User: username.RootUserName()}, "SELECT * FROM system.jobs WHERE id = $1",
We can also use SELECT id
here.
pkg/sql/sql_activity_update_job_test.go
line 238 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
nit: perhaps just use backticks and wrap these statements a bit more? Also nit ... either capitalize SQL keywords (or don't), but try to be consistent (WHERE, LIKE, GROUP BY, etc)
Please capitalize keywords! It makes the SQL more readable. I think that there are still some instances of "like" (vs LIKE) in the queries here.
pkg/sql/sql_activity_update_job_test.go
line 178 at r7 (raw file):
} _, err = db.ExecContext(ctx, "SET CLUSTER SETTING sql.stats.activity.top.max = 5")
Would it be worth testing that after changing this setting it does what's expected?
pkg/upgrade/upgrades/system_activity_update_job_test.go
line 66 at r7 (raw file):
) row := db.QueryRow("select count(*) from system.public.jobs where id = 103")
nit: SELECT count(*) FROM system.public.jobs WHERE id = 103
} | ||
|
||
type activityUpdaterMetrics struct { | ||
numErrors *metric.Counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must export numErrors (capitalize) if you want those to show up as metrics. MetricStruct ignore any unexported metrics.
pkg/sql/sql_activity_update_job.go
Outdated
if jobs.HasErrJobCanceled(jobErr) { | ||
err := errors.NewAssertionErrorWithWrappedErrf(jobErr, | ||
"sql activity is not cancelable") | ||
log.Infof(ctx, "%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Error perhaps? Definitely an error...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 46 at r6 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
I think for this PR you can keep this frequency, so we unblock the GA blocker, but then more tests can be done and select a frequency using actual data as the base
I'm leaning to option1 from your list, I think the process can take more than 5 min, so could be risky, and the 3 could cause latency which is what we're trying to avoid.
I updated it to always execute once the node the job is on does a flush. It simplified the execution logic.
pkg/sql/sql_activity_update_job.go
line 67 at r6 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
to the activity (some extra space comma and the)
Done.
pkg/sql/sql_activity_update_job.go
line 371 at r6 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
you're missing the total execution time from your list here
Done.
pkg/sql/sql_activity_update_job.go
line 454 at r6 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
you're missing the total execution time (that we use for the % of all runtime)
Done.
pkg/sql/sql_activity_update_job.go
line 218 at r7 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Why 6? How do we know the 6 most popular columns?
There are too many columns to do all of them. We chose to only optimize for the most common columns that are sorted based on feedback and telemetry.
pkg/jobs/registry_test.go
line 475 at r7 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
We can also use
SELECT id
here.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 129 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
you must export numErrors (capitalize) if you want those to show up as metrics. MetricStruct ignore any unexported metrics.
Done.
pkg/sql/sql_activity_update_job.go
line 154 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
log.Error perhaps? Definitely an error...
Done.
pkg/sql/sql_activity_update_job.go
line 237 at r7 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Could you add an execbuilder test with all of these large queries (I count 4 of them)? This will make it easier to review and also for the SQL Queries team to monitor plan changes that may affect performance over time. You could make a new file pkg/sql/opt/exec/execbuilder/testdata/observability containing these queries. These are a type of logic test, so the test formatting may be familiar (or you can look through the other tests in execbuilder/testdata for reference). Once the queries are in you can generate the plans with
dev test pkg/sql/opt/exec/execbuilder/tests/local -f=TestExecBuilder_observability --rewrite
. Feel free to slack me if you run into trouble with this. Thanks!
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 347 at r9 (raw file):
totalTxnClusterExecCount int64, ) (retErr error) { // Select the top 500 (controlled by sql.stats.activity.top.max) for
minor nit: I think changing the wording slightly to "Select the top 500...for each of ..." would add a bit of clarity. After listing all of them, it may also help to add "Up to 3000 rows (sql.stats.activity.top.max * 6) may be added to transaction_activity." to hammer that point home.
pkg/sql/sql_activity_update_job.go
line 433 at r9 (raw file):
// execution_count, total execution time, service_latency, cpu_sql_nanos, // contention_time, p99_latency, and any statement that was in the // top N transactions and insert into statement_activity table.
Does this mean that any row we upserted in the last query should also be upserted in the following query? If so, I don't fully understand how the following query is achieving that. Also if so, I wonder if we can utilize materialized views to dedupe some work.
Code quote:
// contention_time, p99_latency, and any statement that was in the
// top N transactions and insert into statement_activity table.
pkg/sql/opt/exec/execbuilder/testdata/observability
line 157 at r9 (raw file):
agg_interval, max(metadata) AS metadata, crdb_internal.merge_transaction_stats(array_agg(statistics)) AS statistics,
Should this be merge_statement_stats
? Here and in the other statement activity upsert below.
pkg/sql/opt/exec/execbuilder/testdata/observability
line 487 at r9 (raw file):
spans: /2023-04-10T16:00:00Z-/2023-04-10T16:00:00.000000001Z # Upsert statement_activity
s/statement/transaction
Previously, rharding6373 (Rachael Harding) wrote…
The issue with materialized view is it is part of the schema which mean the logic can only be changed on major versions. If we need to change it to only load the top 10 or if we need to increase it to top 1000 it won't be possible. It would still require a job to run periodically to update the mv. There are 2 tables. One for statement and one for transaction. It first inserts the top N for the transactions. A transaction can have multiple statements. On the statement table we want the top 500, and we also want all the statements that were part of the top N transactions. That way on the ui if a user selects a transaction all the statement information is available for that transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 37 at r7 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
curious that this is not giving a lint error that the
true
value should be on the next line
Done.
pkg/sql/opt/exec/execbuilder/testdata/observability
line 157 at r9 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Should this be
merge_statement_stats
? Here and in the other statement activity upsert below.
Done.
pkg/sql/opt/exec/execbuilder/testdata/observability
line 487 at r9 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
s/statement/transaction
Done.
Previously, rharding6373 (Rachael Harding) wrote…
This was moved to TestSqlActivityUpdateTopLimitJob to avoid the test being to large. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this process continue to work if we change the aggregation interval of sql stats?
Reviewed 1 of 8 files at r1, 5 of 13 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, @rharding6373, and @THardy98)
pkg/sql/sql_activity_update_job.go
line 52 at r1 (raw file):
Previously, j82w (Jake) wrote…
I added a check.
This might become out of sync with the aggregation interval cluster setting used for system table writes -- is that okay?
pkg/sql/sql_activity_update_job.go
line 63 at r11 (raw file):
// sqlStatsActivityMaxPersistedRows specifies maximum number of rows that will be // retained in system.statement_statistics and system.transaction_statistics.
nit: should these be the activity tables?
pkg/sql/sql_activity_update_job.go
line 595 at r11 (raw file):
// ComputeAggregatedTs returns the aggregation timestamp to assign // in-memory SQL stats during storage or aggregation. func (u *SqlActivityUpdater) ComputeAggregatedTs(
nit: does this function need to be exported?
pkg/sql/sql_activity_update_job.go
line 605 at r11 (raw file):
// to or greater than the statistics table. if interval < intervalSqlStats { log.Warningf(ctx, "sql.stats.activity.aggregation.interval: %s must be greater than or equal to sql.stats.aggregation.interval: %s", interval, intervalSqlStats)
Why do we have the additional sql.stats.activity.aggregation_interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @maryliag, @miretskiy, @rharding6373, @THardy98, and @xinhaoz)
pkg/jobs/registry.go
line 714 at r7 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Since we only want to know if a row exists, we can use
SELECT id
here. There's an index on id, so we won't need to fetch extra columns.
Done.
pkg/sql/sql_activity_update_job.go
line 52 at r1 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
This might become out of sync with the aggregation interval cluster setting used for system table writes -- is that okay?
I figured it give flexibility if we want the activity page to have a bigger aggregation interval than system tables. I went ahead and removed it. We can always add it back later if necessary.
pkg/sql/sql_activity_update_job.go
line 63 at r11 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
nit: should these be the activity tables? Why does
Fixed
pkg/sql/sql_activity_update_job.go
line 605 at r11 (raw file):
Previously, xinhaoz (Xin Hao Zhang) wrote…
Why do we have the additional sql.stats.activity.aggregation_interval?
I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @dt, @j82w, @maryliag, @miretskiy, @THardy98, and @xinhaoz)
pkg/sql/sql_activity_update_job.go
line 433 at r9 (raw file):
Previously, j82w (Jake) wrote…
The issue with materialized view is it is part of the schema which mean the logic can only be changed on major versions. If we need to change it to only load the top 10 or if we need to increase it to top 1000 it won't be possible. It would still require a job to run periodically to update the mv.
There are 2 tables. One for statement and one for transaction. It first inserts the top N for the transactions. A transaction can have multiple statements. On the statement table we want the top 500, and we also want all the statements that were part of the top N transactions. That way on the ui if a user selects a transaction all the statement information is available for that transaction.
Ok, thanks.
pkg/sql/opt/exec/execbuilder/testdata/observability
line 157 at r9 (raw file):
Previously, j82w (Jake) wrote…
Done.
I still see we're using crdb_internal.merge_transaction_stats
here, although statistics
comes from the statement_statistics
. Is this expected? It looks like this might have been fixed in pkg/sql/sql_activity_update_job.go.
The system statistics table grow too large for the ui to quickly return results. The new activity tables job does the aggregation and only records the top 500 of each of the 6 columns. This means for a given hour there is a limit of 3000 rows. This allows the ui to return results fast and reliably. If the job detects there are less than 3k rows it will just copy all the rows to the activity tables. Epic: none closes: #98882 Release note (sql change): Adds a new sql activity updater job. The job updates the system transaction_activity and statement_activity tables based on the statistics tables.
Previously, rharding6373 (Rachael Harding) wrote…
Fixed, I just missed this one. I double checked the others. |
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 53afe66 to blathers/backport-release-23.1-100807: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. error creating merge commit from 53afe66 to blathers/backport-release-23.1.0-100807: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.0 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The system statistics table grow to large for the
ui to quickly return results. The new activity
tables job does the aggregation and only records
the top 500 of each of the 6 columns. This means
for a given hour there is a limit of 3000 rows.
This allows the ui to return results fast and reliably.
If the job detects there are less than 3k
rows it will just copy all the rows to
the activity tables.
Epic: none
closes: #98882
Release note (sql change): Adds a new sql activity
updater job. The job updates the system
transaction_activity and statement_activity
tables based on the statistics tables.