Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: remove diagnostics.sql_stat_reset.interval #71597

Merged
merged 1 commit into from
Nov 10, 2021

Conversation

Azhng
Copy link
Contributor

@Azhng Azhng commented Oct 15, 2021

Previously, SQL Stats resets its in-memory SQL Stats
periodically (determined by diagnostics.sql_stat_reset.interval
cluster setting). The periodic reset loop was introduced early
in 21.1 prior to persisted SQL Stats. In 21.2, we introduced
persisted SQL Stats and also runs perodic task that flushes
in-memory SQL Stats into system tables. This two async tasks'
coexistence introduced unwanted race conditions. If the SQL
Stats reset task resets the in-memory store before the
flusher can flush it to system table, then that particular stats
would be lost forever.
This commit change the .Start() method in PersistedSQLStats to
avoid this race condition. However, reported SQL Stats still
periodically resets its in-memory storage in case the telemetry
server goes offline.

Release note (sql change): diagnostics.sql_stats_reset.interval
is removed.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Azhng
Copy link
Contributor Author

Azhng commented Oct 15, 2021

Follow up to #71596

@Azhng Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from 34b5b1d to 7d5d4f3 Compare October 15, 2021 03:47
@Azhng Azhng requested a review from a team October 15, 2021 03:59
@Azhng Azhng marked this pull request as ready for review October 15, 2021 03:59
@Azhng Azhng requested a review from a team October 15, 2021 03:59
@Azhng
Copy link
Contributor Author

Azhng commented Oct 15, 2021

Tagging @cockroachdb/sql-experience for reviewing if I retired the cluster setting properly

@Azhng Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from 7d5d4f3 to d337fdd Compare October 15, 2021 04:22
Copy link
Contributor

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any tests that would be useful to add here? I'm imagining something that would show the interaction of the happy-path flush to system tables with the uh-oh path reset because it's been too long and we're running out of memory. WDYT?

Reviewed 9 of 10 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @matthewtodd)


-- commits, line 17 at r1:
nit: some style / grammar adjustments, WDYT?

Previously, the SQL Stats system reset its in-memory SQL Stats periodically,
determined by the diagnostics.sql_stat_reset.interval cluster setting. This
periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats.
In 21.2, we introduced persisted SQL Stats and along with a perodic task to
flush in-memory SQL Stats into system tables. These two async tasks'
coexistence introduced unwanted race conditions. If the SQL Stats reset task
reset the in-memory store before the flusher could flush it to the system
table, those particular stats would be lost forever.

This commit changes the .Start() method in PersistedSQLStats to avoid this
race condition. However, reported SQL Stats still periodically resets its
in-memory storage in case the telemetry server goes offline.


pkg/sql/conn_executor.go, line 447 at r1 (raw file):

	s.sqlStats.Start(ctx, stopper)

	// reportedStats is periodically cleared to prevent too much SQL Stats

nit: s/much/many/

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need tests that are able to capture the initial issue that this PR is trying to fix otherwise it's just a guess that this change would indeed fix the issue.

Reviewed 8 of 10 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)

@Azhng
Copy link
Contributor Author

Azhng commented Oct 19, 2021

Hmm testing this would be really difficult. Because the solution to the original problematic component is to completely remove the original problematic component. The implication of that means the "new" tests we introduce would be testing against a non-existent component since we will be removing it. The old component also lacks testing knobs for us to properly test this logic, so in order to introduce test for the old component, we need to:

  • commit 1: introduce testing knobs to the old component (that should be deleted)
  • commit 2: introduce the test that ensure it fails
  • commit 3: introduce the fix, but we need to remove the test because it depends on the testing knobs and the old component

The net effect would be the same as in this PR.


On a different subject, I think there's a decent chance that this PR might not address the root issue that caused #71596. I will test my new theory and send in a new PR for that.

However, regardless, we should merge this PR because we don't want 2 different goroutines racing to clear our in-memory data.

@Azhng
Copy link
Contributor Author

Azhng commented Oct 19, 2021

Also the telemetry dumping unit tests was updated to reflect the new change.

@Azhng Azhng requested review from a team and removed request for a team November 9, 2021 16:13
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding testing, there are two separate stories:

  • When I look at this PR without the context of the issue that motivates it, I still see a compelling reason to merge the PR: the periodic flush has become redundant with the new mechanism to persist stats, and it's good to remove redundant code (for hygiene). When removing an obsolete mechanism, we do not typicall introduce new tests that confirm that the mechanism has been removed. Instead, we remove the tests that had been added to support the original mechanism.

    (Btw, I am not recongizing this: where were the unit tests for diagnostics.sql_stats_reset.interval located? These should at least be altered, if not removed outright, by this PR)

  • Generally, if there's a race condition somewhere, it is due to the interaction between two or more subsystems. If we keep both subsystems and then add some coordination mechanism (e.g. a mutex), then yes we do want to add a new test for that race condition, to assert that it does not occur and prevent regressions.

    However, in this case, we are removing one of the participants in the race condition such that only one subsystem remains. After this PR is merged, there is no conflict between two mechanisms any more - no environment where a race condition could even occur. In that case (all concurrent mechanisms, but one, have been removed) we also don't typically add new tests.

Reviewed 10 of 10 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)


-- commits, line 19 at r1:
More words here please.

@Azhng Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from d337fdd to 1e26f2c Compare November 9, 2021 17:22
Previously, the SQL Stats system reset its in-memory SQL Stats periodically,
determined by the diagnostics.sql_stat_reset.interval cluster setting. This
periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats.
In 21.2, we introduced persisted SQL Stats and along with a perodic task to
flush in-memory SQL Stats into system tables. These two async tasks'
coexistence introduced unwanted race conditions. If the SQL Stats reset task
reset the in-memory store before the flusher could flush it to the system
table, those particular stats would be lost forever.

This commit changes the .Start() method in PersistedSQLStats to avoid this
race condition. However, reported SQL Stats still periodically resets its
in-memory storage in case the telemetry server goes offline. As the result,
diagnostics.sql_stats_reset.interval cluster setting is removed.
diagnostics.forced_sql_stats_reset.interval cluster setting now only
controls the reset of the *reported* SQL Stats if it is not collected
by the telemetry reporter.

Release note (sql change): Previously, in-memory SQL Stats was reset
periodically. This behavior is now removed since persisted SQL Stats
is introduced in 21.2. As the result, diagnostics.sql_stats_reset.interval
cluster setting is removed. diagnostics.forced_sql_stats_reset.interval
now only controls the reset of the reported SQL Stats if it is not
collected by the telemetry reporter.
@Azhng Azhng force-pushed the sqlstats-nuke-telemetry-dump branch from 1e26f2c to af1866e Compare November 9, 2021 17:32
Copy link
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having a difficult time trying to find where we originally tested the behaviour for diagnostics.sql_stats_reset.interval cluster setting, but I couldn't find any. Perhaps this was the reason why this issue was not caught early on.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @maryliag, and @matthewtodd)


-- commits, line 17 at r1:

Previously, matthewtodd (Matthew Todd) wrote…

nit: some style / grammar adjustments, WDYT?

Previously, the SQL Stats system reset its in-memory SQL Stats periodically,
determined by the diagnostics.sql_stat_reset.interval cluster setting. This
periodic reset loop was introduced early in 21.1 prior to persisted SQL Stats.
In 21.2, we introduced persisted SQL Stats and along with a perodic task to
flush in-memory SQL Stats into system tables. These two async tasks'
coexistence introduced unwanted race conditions. If the SQL Stats reset task
reset the in-memory store before the flusher could flush it to the system
table, those particular stats would be lost forever.

This commit changes the .Start() method in PersistedSQLStats to avoid this
race condition. However, reported SQL Stats still periodically resets its
in-memory storage in case the telemetry server goes offline.

Done.


-- commits, line 19 at r1:

Previously, knz (kena) wrote…

More words here please.

Added more description to the release note.


pkg/sql/conn_executor.go, line 447 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

nit: s/much/many/

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @Azhng)

@Azhng
Copy link
Contributor Author

Azhng commented Nov 9, 2021

TFTR!

bors r=knz,maryliag

@craig
Copy link
Contributor

craig bot commented Nov 9, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 9, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 10, 2021

Build succeeded:

@craig craig bot merged commit fcb5221 into cockroachdb:master Nov 10, 2021
@blathers-crl
Copy link

blathers-crl bot commented Nov 10, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from af1866e to blathers/backport-release-21.2-71597: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@knz
Copy link
Contributor

knz commented Nov 10, 2021

should this be blindly backported to 21.2? Was the code for stats persistence already in 21.2?

@Azhng
Copy link
Contributor Author

Azhng commented Nov 19, 2021

blathers backport 21.2

@blathers-crl
Copy link

blathers-crl bot commented Nov 19, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating backport branch refs/heads/blathers/backport-release-21.2-71597: POST https://api.github.com/repos/cockroachdb/cockroach/git/refs: 422 Reference already exists []

Backport to branch 21.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants