sqlstats: add limit to flush data #97123

maryliag · 2023-02-14T16:49:35Z

Previously, we would always let data get flushed to system tables, then the cleanup job would remove the excess data.
When the cleanup job have hiccups and get stuck data was still continuously being added, making the situation worst and requiring reset of stats in some cases.

To prevent this cases, this commit is adding a limit of how much excess data can be flushed, this way if the cleanup job stops working, the data won't blow up.

Part Of #97074

Follow up work can do a better job at cleaning the data during the flush, but this commit focus on adding the safety so it can be backported.

Release note (sql change): Add a hard limit of how much data can be flushed to system tables for sql stats.

cockroach-teamcity · 2023-02-14T16:49:47Z

This change is

j82w · 2023-02-14T19:28:11Z

-- commits line 13 at r1:
What is the concern with the data blowing up? Just wondering if the persisted table growing is better than losing the stats data.

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w)

-- commits line 13 at r1:

Previously, j82w (Jake) wrote…

What is the concern with the data blowing up? Just wondering if the persisted table growing is better than losing the stats data.

the problem is that we saw the case of getting 4Tb of data, and if the user wants the limit of 1M, that is a lot more.
After that we can work on improvements such as check the status of cleanup job to give warning or combining with flush, and this would no longer be an issue. But at least is a safety for now

j82w · 2023-02-14T21:25:29Z

-- commits line 13 at r1:

Previously, maryliag (Marylia Gutierrez) wrote…

the problem is that we saw the case of getting 4Tb of data, and if the user wants the limit of 1M, that is a lot more.
After that we can work on improvements such as check the status of cleanup job to give warning or combining with flush, and this would no longer be an issue. But at least is a safety for now

Even with 1M the UI will still fail to load right? Just wondering if we should focus on reducing the response size limit or do some change so that the UI always loads.

xinhaoz · 2023-02-14T21:31:42Z

pkg/sql/sqlstats/persistedsqlstats/flush.go line 49 at r1 (raw file):

	shouldWipeInMemoryStats = shouldWipeInMemoryStats || (!enabled && allowDiscardWhenDisabled)

	if shouldWipeInMemoryStats {

Do we need to change this behaviour at all? What did we do for in-memory stats prior to persisted stats? It would be nice if it behaved like a FIFO cache so that we could hold on to the in-memory stats between flush attempts.

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)

-- commits line 13 at r1:

Previously, j82w (Jake) wrote…

Even with 1M the UI will still fail to load right? Just wondering if we should focus on reducing the response size limit or do some change so that the UI always loads.

Yes, it does fail to load, but that is not what this PR is trying to address. That would require more significant changes. Here I'm just focusing on the case where the cleanup gets stuck, fails, and I'm adding the safety to avoid data blowing up.

pkg/sql/sqlstats/persistedsqlstats/flush.go line 49 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Do we need to change this behaviour at all? What did we do for in-memory stats prior to persisted stats? It would be nice if it behaved like a FIFO cache so that we could hold on to the in-memory stats between flush attempts.

I'm still deciding if I should change this or not. I think I might keep as is for this PR, to focus on the safety check, and further improvements can look into this better. Specially if we want to change to do a cleanup along with flush, this behaviour will certainly change.

Previously, we would always let data get flushed to system tables, then the cleanup job would remove the excess data. When the cleanup job have hiccups and get stuck data was still continuously being added, making the situation worst and requiring reset of stats in some cases. To prevent this cases, this commit is adding a limit of how much excess data can be flushed, this way if the cleanup job stops working, the data won't blow up. Part Of cockroachdb#97074 Follow up work can do a better job at cleaning the data during the flush, but this commit focus on adding the safety so it can be backported. Release note (sql change): Add a hard limit of how much data can be flushed to system tables for sql stats.

ericharmeling

LGTM as a safety check! I had a minor question/suggestion, but that could also wait for the follow-up PR.

Reviewed 2 of 3 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @maryliag, and @xinhaoz)

pkg/sql/sqlstats/persistedsqlstats/flush.go line 104 at r2 (raw file):

	}
	actualSize := float64(tree.MustBeDInt(row[0]))
	return actualSize > (maxPersistedRows * 1.5)

The 1.5 multiplier seems a bit arbitrary. Would it make more sense to use sql.metrics.max_mem_stmt_fingerprints (i.e., up to one extra flush)?

Code quote:

1.5

maryliag

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @j82w, and @xinhaoz)

pkg/sql/sqlstats/persistedsqlstats/flush.go line 104 at r2 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

The 1.5 multiplier seems a bit arbitrary. Would it make more sense to use sql.metrics.max_mem_stmt_fingerprints (i.e., up to one extra flush)?

My thinking for this number was:
with the default for memory 100k and default for persisted 1M, imagine that during each flush we do send 100k, and we do want to keep sending those until we reach the next hourly cleanup, so when you reach the 50min you would have added 500k extra, then on the hour you're about the clean, so I stop introducing more data.
Allowing just the max for the memory as extra is more conservative, but I wanted to give the extra room for data, but still not letting blowing up. I think it would be okay so let the data reach 1.5 of its max, knowing it will get cleanup soon.
We can always change this value if we receive any feedback on it off course.

maryliag · 2023-02-15T14:35:56Z

TFTR!
bors r+

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @maryliag, and @xinhaoz)

pkg/sql/sqlstats/persistedsqlstats/flush.go line 85 at r2 (raw file):

	maxPersistedRows := float64(SQLStatsMaxPersistedRows.Get(&s.SQLStats.GetClusterSettings().SV))

	readStmt := `

Do we need to do any perf testing on this? I've noticed sometimes querying this table can take several seconds to over a minute. Would it be more efficient to get a descriptor and get a rough disk size or the table stats or directly query the time series db?

craig · 2023-02-15T15:03:38Z

Build failed:

Bazel Essential CI (Cockroach)

maryliag · 2023-02-15T15:04:17Z

bors r-

maryliag · 2023-02-16T20:53:59Z

bors r+

craig · 2023-02-16T22:30:05Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-02-16T22:30:20Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 5f357ba to blathers/backport-release-22.1-97123: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed. We have been recommening a value for 20k, so being more conservative with the default (plus the changes on cockroachdb#97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k. Fixes cockroachdb#97528 Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics.

97486: ui: fix jobs page error state r=xinhaoz a=xinhaoz Previously, the error state for the jobs page was only displayed if there were previously jobs returned. We should show the api error even when we have never received a successful jobs payload (e.g. error on first request). This commit changes error displaying in the jobs page such that we will show the request error regardless of whether or not we have previously received data. If we have previous data when we receive an unsuccessful request response, we will show the error along with the existing data. Epic: none Release note (bug fix): Jobs page properly shows error state when we receive an error during data fetching. 97642: sqlstats: increase default value for deleted rows r=maryliag a=maryliag During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed. We have been recommening a value for 20k, so being more conservative with the default (plus the changes on #97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k. Fixes #97528 Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics. 97662: ui: show recent executions even without lock information r=maryliag a=maryliag Previously, if the call to retrieve contention/lock information returned a max size limit error, we were not displaying any data on the recent executions. Now we show the data returned with a warning that cluster locks information might be missing. Part Of #96184 <img width="1133" alt="Screenshot 2023-02-24 at 6 25 39 PM" src="https://user-images.githubusercontent.com/1017486/221320673-65a8f2eb-e7e3-4f27-b911-4111e8dc6728.png"> Release note (ui change): Still show active execution information when there is a max size limit error. Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: maryliag <[email protected]>

During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed. We have been recommening a value for 20k, so being more conservative with the default (plus the changes on #97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k. Fixes #97528 Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics.

maryliag added backport-22.1.x labels Feb 14, 2023

maryliag requested a review from a team February 14, 2023 16:49

maryliag force-pushed the max-flush branch 2 times, most recently from 077b002 to 3d99f14 Compare February 14, 2023 19:06

maryliag commented Feb 14, 2023

View reviewed changes

maryliag force-pushed the max-flush branch from 3d99f14 to 5f357ba Compare February 14, 2023 22:55

ericharmeling approved these changes Feb 15, 2023

View reviewed changes

maryliag commented Feb 15, 2023

View reviewed changes

j82w reviewed Feb 15, 2023

View reviewed changes

maryliag added the O-postmortem Originated from a Postmortem action item. label Feb 15, 2023

craig bot merged commit 91daf41 into cockroachdb:master Feb 16, 2023

blathers-crl bot mentioned this pull request Feb 16, 2023

release-22.2: sqlstats: add limit to flush data #97269

Closed

cockroach-teamcity mentioned this pull request Feb 17, 2023

PR #97123 - sqlstats: add limit to flush data cockroachdb/docs#16287

Open

maryliag deleted the max-flush branch February 17, 2023 19:33

This was referenced Feb 21, 2023

release-22.2: sqlstats: add limit to flush data #97399

Merged

release-22.1: sqlstats: add limit to flush data #97401

Merged

maryliag mentioned this pull request Feb 24, 2023

sqlstats: increase default value for deleted rows #97642

Merged

blathers-crl bot mentioned this pull request Feb 27, 2023

release-22.1: sqlstats: increase default value for deleted rows #97722

Merged

blathers-crl bot mentioned this pull request Feb 27, 2023

release-22.2: sqlstats: increase default value for deleted rows #97724

Merged

xinhaoz mentioned this pull request May 23, 2023

sqlstats, ui: show message on sql activity page when flushing stopped due to reaching table row count limit #103811

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlstats: add limit to flush data #97123

sqlstats: add limit to flush data #97123

maryliag commented Feb 14, 2023

cockroach-teamcity commented Feb 14, 2023

j82w commented Feb 14, 2023

maryliag left a comment

j82w commented Feb 14, 2023

xinhaoz commented Feb 14, 2023

maryliag left a comment

ericharmeling left a comment

maryliag left a comment

maryliag commented Feb 15, 2023

j82w left a comment

craig bot commented Feb 15, 2023

maryliag commented Feb 15, 2023

maryliag commented Feb 16, 2023

craig bot commented Feb 16, 2023

blathers-crl bot commented Feb 16, 2023

sqlstats: add limit to flush data #97123

sqlstats: add limit to flush data #97123

Conversation

maryliag commented Feb 14, 2023

cockroach-teamcity commented Feb 14, 2023

j82w commented Feb 14, 2023

maryliag left a comment

Choose a reason for hiding this comment

j82w commented Feb 14, 2023

xinhaoz commented Feb 14, 2023

maryliag left a comment

Choose a reason for hiding this comment

ericharmeling left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

maryliag commented Feb 15, 2023

j82w left a comment

Choose a reason for hiding this comment

craig bot commented Feb 15, 2023

maryliag commented Feb 15, 2023

maryliag commented Feb 16, 2023

craig bot commented Feb 16, 2023

blathers-crl bot commented Feb 16, 2023