Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlstats: add limit to flush data #97123

Merged
merged 1 commit into from
Feb 16, 2023
Merged

Conversation

maryliag
Copy link
Contributor

Previously, we would always let data get flushed to system tables, then the cleanup job would remove the excess data.
When the cleanup job have hiccups and get stuck data was still continuously being added, making the situation worst and requiring reset of stats in some cases.

To prevent this cases, this commit is adding a limit of how much excess data can be flushed, this way if the cleanup job stops working, the data won't blow up.

Part Of #97074

Follow up work can do a better job at cleaning the data during the flush, but this commit focus on adding the safety so it can be backported.

Release note (sql change): Add a hard limit of how much data can be flushed to system tables for sql stats.

@maryliag maryliag requested a review from a team February 14, 2023 16:49
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@maryliag maryliag force-pushed the max-flush branch 2 times, most recently from 077b002 to 3d99f14 Compare February 14, 2023 19:06
@j82w
Copy link
Contributor

j82w commented Feb 14, 2023

-- commits line 13 at r1:
What is the concern with the data blowing up? Just wondering if the persisted table growing is better than losing the stats data.

Copy link
Contributor Author

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w)


-- commits line 13 at r1:

Previously, j82w (Jake) wrote…

What is the concern with the data blowing up? Just wondering if the persisted table growing is better than losing the stats data.

the problem is that we saw the case of getting 4Tb of data, and if the user wants the limit of 1M, that is a lot more.
After that we can work on improvements such as check the status of cleanup job to give warning or combining with flush, and this would no longer be an issue. But at least is a safety for now

@j82w
Copy link
Contributor

j82w commented Feb 14, 2023

-- commits line 13 at r1:

Previously, maryliag (Marylia Gutierrez) wrote…

the problem is that we saw the case of getting 4Tb of data, and if the user wants the limit of 1M, that is a lot more.
After that we can work on improvements such as check the status of cleanup job to give warning or combining with flush, and this would no longer be an issue. But at least is a safety for now

Even with 1M the UI will still fail to load right? Just wondering if we should focus on reducing the response size limit or do some change so that the UI always loads.

@xinhaoz
Copy link
Member

xinhaoz commented Feb 14, 2023

pkg/sql/sqlstats/persistedsqlstats/flush.go line 49 at r1 (raw file):

	shouldWipeInMemoryStats = shouldWipeInMemoryStats || (!enabled && allowDiscardWhenDisabled)

	if shouldWipeInMemoryStats {

Do we need to change this behaviour at all? What did we do for in-memory stats prior to persisted stats? It would be nice if it behaved like a FIFO cache so that we could hold on to the in-memory stats between flush attempts.

Copy link
Contributor Author

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @xinhaoz)


-- commits line 13 at r1:

Previously, j82w (Jake) wrote…

Even with 1M the UI will still fail to load right? Just wondering if we should focus on reducing the response size limit or do some change so that the UI always loads.

Yes, it does fail to load, but that is not what this PR is trying to address. That would require more significant changes. Here I'm just focusing on the case where the cleanup gets stuck, fails, and I'm adding the safety to avoid data blowing up.


pkg/sql/sqlstats/persistedsqlstats/flush.go line 49 at r1 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

Do we need to change this behaviour at all? What did we do for in-memory stats prior to persisted stats? It would be nice if it behaved like a FIFO cache so that we could hold on to the in-memory stats between flush attempts.

I'm still deciding if I should change this or not. I think I might keep as is for this PR, to focus on the safety check, and further improvements can look into this better. Specially if we want to change to do a cleanup along with flush, this behaviour will certainly change.

Previously, we would always let data get flushed to
system tables, then the cleanup job would remove the
excess data.
When the cleanup job have hiccups and get stuck data
was still continuously being added, making the situation
worst and requiring reset of stats in some cases.

To prevent this cases, this commit is adding a limit of
how much excess data can be flushed, this way if the cleanup
job stops working, the data won't blow up.

Part Of cockroachdb#97074

Follow up work can do a better job at cleaning the data
during the flush, but this commit focus on adding the safety
so it can be backported.

Release note (sql change): Add a hard limit of how much data
can be flushed to system tables for sql stats.
Copy link
Contributor

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a safety check! I had a minor question/suggestion, but that could also wait for the follow-up PR.

Reviewed 2 of 3 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @maryliag, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 104 at r2 (raw file):

	}
	actualSize := float64(tree.MustBeDInt(row[0]))
	return actualSize > (maxPersistedRows * 1.5)

The 1.5 multiplier seems a bit arbitrary. Would it make more sense to use sql.metrics.max_mem_stmt_fingerprints (i.e., up to one extra flush)?

Code quote:

1.5

Copy link
Contributor Author

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @j82w, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 104 at r2 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

The 1.5 multiplier seems a bit arbitrary. Would it make more sense to use sql.metrics.max_mem_stmt_fingerprints (i.e., up to one extra flush)?

My thinking for this number was:
with the default for memory 100k and default for persisted 1M, imagine that during each flush we do send 100k, and we do want to keep sending those until we reach the next hourly cleanup, so when you reach the 50min you would have added 500k extra, then on the hour you're about the clean, so I stop introducing more data.
Allowing just the max for the memory as extra is more conservative, but I wanted to give the extra room for data, but still not letting blowing up. I think it would be okay so let the data reach 1.5 of its max, knowing it will get cleanup soon.
We can always change this value if we receive any feedback on it off course.

@maryliag
Copy link
Contributor Author

TFTR!
bors r+

Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @maryliag, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 85 at r2 (raw file):

	maxPersistedRows := float64(SQLStatsMaxPersistedRows.Get(&s.SQLStats.GetClusterSettings().SV))

	readStmt := `

Do we need to do any perf testing on this? I've noticed sometimes querying this table can take several seconds to over a minute. Would it be more efficient to get a descriptor and get a rough disk size or the table stats or directly query the time series db?

@craig
Copy link
Contributor

craig bot commented Feb 15, 2023

Build failed:

@maryliag
Copy link
Contributor Author

bors r-

@maryliag maryliag added the O-postmortem Originated from a Postmortem action item. label Feb 15, 2023
@maryliag
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 16, 2023

Build succeeded:

@craig craig bot merged commit 91daf41 into cockroachdb:master Feb 16, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 16, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5f357ba to blathers/backport-release-22.1-97123: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

maryliag added a commit to maryliag/cockroach that referenced this pull request Feb 24, 2023
During the sql stats compaction job, we limit the amount of
rows being deleted per transaction. We used a default value
of 1024, but we have been increasinly seeing customer needing
to increase this value to allow the job to keep up with the
large amount of data being flushed.
We have been recommening a value for 20k, so being more
conservative with the default (plus the changes on cockroachdb#97123
that won't let tables get in a state with so many rows),
this commit changes the value to 10k.

Fixes cockroachdb#97528

Release note (sql change): Increase the default value of
`sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase
efficiency of the cleanup job for sql statistics.
craig bot pushed a commit that referenced this pull request Feb 27, 2023
97486: ui: fix jobs page error state r=xinhaoz a=xinhaoz

Previously, the error state for the jobs page was
only displayed if there were previously jobs
returned. We should show the api error even when
we have never received a successful jobs payload
(e.g. error on first request). This commit changes 
error displaying in the jobs page  such that we will 
show the request error  regardless of whether or 
not we have previously received data. If we have 
previous data when we receive an unsuccessful request 
response, we will show the error along with the existing data.

Epic: none

Release note (bug fix): Jobs page properly shows error state when we receive an error during data fetching.

97642: sqlstats: increase default value for deleted rows r=maryliag a=maryliag

During the sql stats compaction job, we limit the amount of rows being deleted per transaction. We used a default value of 1024, but we have been increasinly seeing customer needing to increase this value to allow the job to keep up with the large amount of data being flushed.
We have been recommening a value for 20k, so being more conservative with the default (plus the changes on #97123 that won't let tables get in a state with so many rows), this commit changes the value to 10k.

Fixes #97528

Release note (sql change): Increase the default value of `sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase efficiency of the cleanup job for sql statistics.

97662: ui: show recent executions even without lock information r=maryliag a=maryliag

Previously, if the call to retrieve contention/lock information returned a max size limit error, we were not displaying any data on the recent executions.
Now we show the data returned with a warning that cluster locks information might be missing.

Part Of #96184

<img width="1133" alt="Screenshot 2023-02-24 at 6 25 39 PM" src="https://user-images.githubusercontent.com/1017486/221320673-65a8f2eb-e7e3-4f27-b911-4111e8dc6728.png">


Release note (ui change): Still show active execution information when there is a max size limit error.

Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: maryliag <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Feb 27, 2023
During the sql stats compaction job, we limit the amount of
rows being deleted per transaction. We used a default value
of 1024, but we have been increasinly seeing customer needing
to increase this value to allow the job to keep up with the
large amount of data being flushed.
We have been recommening a value for 20k, so being more
conservative with the default (plus the changes on #97123
that won't let tables get in a state with so many rows),
this commit changes the value to 10k.

Fixes #97528

Release note (sql change): Increase the default value of
`sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase
efficiency of the cleanup job for sql statistics.
blathers-crl bot pushed a commit that referenced this pull request Feb 27, 2023
During the sql stats compaction job, we limit the amount of
rows being deleted per transaction. We used a default value
of 1024, but we have been increasinly seeing customer needing
to increase this value to allow the job to keep up with the
large amount of data being flushed.
We have been recommening a value for 20k, so being more
conservative with the default (plus the changes on #97123
that won't let tables get in a state with so many rows),
this commit changes the value to 10k.

Fixes #97528

Release note (sql change): Increase the default value of
`sql.stats.cleanup.rows_to_delete_per_txn` to 10k, to increase
efficiency of the cleanup job for sql statistics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-postmortem Originated from a Postmortem action item.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants