Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: scheduled logger to capture index usage stats #76886

Conversation

THardy98
Copy link

@THardy98 THardy98 commented Feb 22, 2022

Resolves: #72486

This change introduces a scheduled logger that captures index usage
statistics logs on a time interval.

Release note (sql change): Initial implementation of a scheduled logger
used to capture index usage statistics to the telemetry logging channel.

Release justification: Category 4: Low risk, high benefit changes to
existing functionality.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@THardy98 THardy98 marked this pull request as draft February 22, 2022 14:17
@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch 4 times, most recently from 48fccc1 to 60d2799 Compare February 22, 2022 16:03
@THardy98 THardy98 requested review from Azhng and a team February 22, 2022 16:22
@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch 5 times, most recently from 7efa752 to 968c579 Compare February 23, 2022 15:19
Copy link
Contributor

@jocrl jocrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking two questions but they're not blockers to approval, so don't worry about them if you're busy and trying to get things in 🙂.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @THardy98)


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 146 at r1 (raw file):

// logIndexUsageStatsWithDelay logs a slice of eventpb.EventPayload at half
// second intervals (2 logs per second) to avoid exceeding the 10 log-line per
// second limit per node on the telemetry logging pipeline.

Just curious, is this 10 log-line limit cumulative across other things that might also be logging in the telemetry pipeline?

Code quote:

// second intervals (2 logs per second) to avoid exceeding the 10 log-line per
// second limit per node on the telemetry logging pipeline.

pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

  CommonEventDetails common = 1 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true];

  // Couldn't use roachpb.CollectedIndexUsageStatistics due to circular dependency.

Just out of curiosity, why is the solution here not to pull something out in a way that removes the circular dependency (perhaps as a long-term fix if you're trying to get this in)? Are the module organizations too complicated/don't lend to reorganization/we're usually fine was just duplicating things?

Copy link
Contributor

@jocrl jocrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and to add tests before un-drafting.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @THardy98)

@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch from 968c579 to 697727f Compare February 23, 2022 21:21
Copy link
Contributor

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 10 files at r1, 1 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, and @THardy98)


pkg/server/server_sql.go, line 1249 at r3 (raw file):

	)

	scheduledlogging.StartCapturingIndexUsageStats(ctx, stopper, s.execCfg.DB, s.execCfg.Settings, s.internalExecutor)

Maybe just call this Start? Since we're dealing with a global limit on the telemetry pipeline, I imagine any future scheduled logging would happen inside the same async task?


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 199 at r3 (raw file):

// second intervals (2 logs per second) to avoid exceeding the 10 log-line per
// second limit per node on the telemetry logging pipeline.
func logIndexUsageStatsWithDelay(

Also, I notice that this function doesn't care about the EventPayload contents. Perhaps there's a natural seam here where we've got one thing selecting what we want to log (just index usage stats for now) and another thing that slowly writes them into the logging pipeline. I see now that decoupling them means you need some kind of buffering, which gets more complicated, so maybe that's more than we need to think about for now?


pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

Previously, jocrl (Josephine) wrote…

Just out of curiosity, why is the solution here not to pull something out in a way that removes the circular dependency (perhaps as a long-term fix if you're trying to get this in)? Are the module organizations too complicated/don't lend to reorganization/we're usually fine was just duplicating things?

I'm curious about this, too.

@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch from 697727f to cefdbd2 Compare February 25, 2022 18:28
Copy link
Author

@THardy98 THardy98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, and @matthewtodd)


pkg/server/server_sql.go, line 1249 at r3 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Maybe just call this Start? Since we're dealing with a global limit on the telemetry pipeline, I imagine any future scheduled logging would happen inside the same async task?

I've changed this to Start, and yes, I believe that the idea going forward. From what I've seen, Start also seems to be the convention when starting background processes.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 146 at r1 (raw file):

Previously, jocrl (Josephine) wrote…

Just curious, is this 10 log-line limit cumulative across other things that might also be logging in the telemetry pipeline?

Yes. Currently, the only other event being emitted to the telemetry pipeline is SampledQuery. Consequently, I reduced the frequency at which it could emit logs from 10 logs per second, to 8 per second, to allow CaptureIndexUsageStats to emit 2 logs per second.

IIUC, there could be some improvement on this in the future.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 199 at r3 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Also, I notice that this function doesn't care about the EventPayload contents. Perhaps there's a natural seam here where we've got one thing selecting what we want to log (just index usage stats for now) and another thing that slowly writes them into the logging pipeline. I see now that decoupling them means you need some kind of buffering, which gets more complicated, so maybe that's more than we need to think about for now?

I'm not sure I fully understand this, but maybe I can provide some context with my own efforts to write a logging scheduler that supported any logger. Initially, I had written the scheduler to allow for multiple loggers that wanted to emit events on an interval. It was effectively the same as the current scheduler but it started a separate go-routine for each logger, so was a bit expensive. It didn't have any need for buffering as each logger was responsible for their own schedule and writing, but the number of go-routines likely wouldn't scale well over time. I was thinking of using some sort of a buffer that all logs could write to, and a background go-routine would periodically write the actual logs from the buffer. Is this in line with what you were thinking?


pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

I'm curious about this, too.

I don't think it would be very difficult. It would just need a new .proto file in a separate package (maybe idxusage in this case?). That being said, I'm not aware if moving roachpb.CollectedIndexUsageStatistics would cause for other issues (or maybe cause a different import cycle).

@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch 5 times, most recently from 6e219a4 to e4eddd5 Compare February 25, 2022 19:27
Copy link
Contributor

@jocrl jocrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Approving for addressing my questions, and I'll leave the stamp to Matthew.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @matthewtodd, and @THardy98)


pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

Previously, THardy98 (Thomas Hardy) wrote…

I don't think it would be very difficult. It would just need a new .proto file in a separate package (maybe idxusage in this case?). That being said, I'm not aware if moving roachpb.CollectedIndexUsageStatistics would cause for other issues (or maybe cause a different import cycle).

I would suggest that if it can be done, and if they are conceptually the same thing, then pulling it out to avoid duplication would be helpful. I'm not familiar enough to assess how neutral/concerning this is, and so my uninformed sentiment is to leave it up to you whether to make the change in this pr or punt to the future.

@jocrl
Copy link
Contributor

jocrl commented Feb 25, 2022

Also, FYI on the failing TeamCity build 🙃

@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch 2 times, most recently from 1368c96 to e0e3ed9 Compare February 25, 2022 20:06
@THardy98 THardy98 marked this pull request as ready for review February 25, 2022 20:09
@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch from e0e3ed9 to 0ee384d Compare February 25, 2022 20:15
@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch 4 times, most recently from c3f5922 to d39a917 Compare February 28, 2022 14:14
@THardy98 THardy98 changed the title sql: created a scheduled logger to capture index usage stats sql: scheduled logger to capture index usage stats Feb 28, 2022
@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch 4 times, most recently from ea0dbcf to 84e1807 Compare February 28, 2022 15:49
Copy link
Contributor

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):

// on a scheduled interval.
type CaptureIndexUsageStatsLoggingScheduler struct {
	DB                      *kv.DB

how come this is exported ?


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 83 at r11 (raw file):

type CaptureIndexUsageStatsLoggingScheduler struct {
	DB                      *kv.DB
	cs                      *cluster.Settings

(optional) nit: i think most of the code base i've seen uses either st or settings as abbreivation rather than cs


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 152 at r11 (raw file):

				err := s.DB.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
					return captureIndexUsageStats(ctx, s.ie, stopper, telemetryCaptureIndexUsageStatsLoggingDelay.Get(&s.cs.SV))

Hmm why are we starting a transaction here? The txn isn't passed to captureIndexUsageStats()


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 156 at r11 (raw file):

				if err != nil {
					log.Errorf(ctx, "error capturing index usage stats: %+v", err)
				}

Hmm I feelerror is indicating something severely wrong with the cluster and demand immediate attention. Might be worth to just call this a Warning instead. And guard it behind log.V() so it doesn't pollute the log file too much.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 225 at r11 (raw file):

			}

			databaseName := tree.MustBeDString(row[0])

minor nit: hmm why not just directly use the databaseName that we already have? the '%s' as database_name feels a bit unnecessary.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 257 at r11 (raw file):

		if err != nil {
			return err
		}

small nit: inline the err assignment

if err = it.Close(); err != nil {
  return err
}

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 43 at r11 (raw file):

func (s *stubDurations) setLoggingDuration(d time.Duration) {
	s.RWMutex.Lock()

nit: you can just do s.Lock() since the lock is embedded in the struct


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 66 at r11 (raw file):

}

func installTelemetryLogFileSink(sc *log.TestLogScope, t *testing.T) func() {

nit: usually t *testing.T is the first args in the function


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 78 at r11 (raw file):

	if err := cfg.Validate(&dir); err != nil {
		t.Fatal(err)
	}

small nit: use requirehere?


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 162 at r11 (raw file):

	// of logs for each index.
	testutils.SucceedsWithin(t, func() error {
		return checkNumTotalEntriesAndNumIndexEntries(expectedTotalNumEntriesInSingleInterval, expectedNumberOfIndividualIndexEntriesInSingleInterval)

style nit: wrap the long line


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 205 at r11 (raw file):

		0,
		math.MaxInt64,
		10000,

Hmm, what do these literals mean?


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 219 at r11 (raw file):

	testData := []time.Duration{
		0 * time.Second,
		stubScheduleInterval,    /* second difference between first and second schedule */

style nit: we don't have too much inline comments with /**/ like this one. Move it above the stubScheduleInterval and swtich to use //


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 220 at r11 (raw file):

		0 * time.Second,
		stubScheduleInterval,    /* second difference between first and second schedule */
		sd.getOverlapDuration(), /* second difference between second and third schedule */

to clarify, do you mean the difference in number of seconds ?

Copy link
Author

@THardy98 THardy98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

how come this is exported ?

There is no reason for it to be exported, I've un-exported it.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 83 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

(optional) nit: i think most of the code base i've seen uses either st or settings as abbreivation rather than cs

Changed to st.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 152 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Hmm why are we starting a transaction here? The txn isn't passed to captureIndexUsageStats()

No need, removed.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 156 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Hmm I feelerror is indicating something severely wrong with the cluster and demand immediate attention. Might be worth to just call this a Warning instead. And guard it behind log.V() so it doesn't pollute the log file too much.

I changed it to Warningf which I think should be fine. Only logs to the DEV channel.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 225 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

minor nit: hmm why not just directly use the databaseName that we already have? the '%s' as database_name feels a bit unnecessary.

Removed from the query, pass databaseName directly into the log payload.


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 257 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

small nit: inline the err assignment

if err = it.Close(); err != nil {
  return err
}

Done.


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 43 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

nit: you can just do s.Lock() since the lock is embedded in the struct

Done.


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 66 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

nit: usually t *testing.T is the first args in the function

Done.


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 78 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

small nit: use requirehere?

Done.


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 162 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

style nit: wrap the long line

Done for each checkNumTotalEntriesAndNumIndexEntries call.


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 205 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Hmm, what do these literals mean?

I've assigned them to variables for naming.


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 219 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

style nit: we don't have too much inline comments with /**/ like this one. Move it above the stubScheduleInterval and swtich to use //

Done.


pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 220 at r11 (raw file):

Previously, Azhng (Archer Zhang) wrote…

to clarify, do you mean the difference in number of seconds ?

Yes, clarified in the comment now.

@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch from 84e1807 to 2f6f68d Compare February 28, 2022 17:43
Copy link
Contributor

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)


pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):

Previously, THardy98 (Thomas Hardy) wrote…

There is no reason for it to be exported, I've un-exported it.

Ugh, I think it's still being exported 😅

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r3, 4 of 5 files at r8, 2 of 2 files at r9, 6 of 8 files at r10, 2 of 2 files at r12, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @maryliag, @matthewtodd, and @THardy98)

@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch 5 times, most recently from 2fa3944 to 1d8327f Compare February 28, 2022 19:55
This change introduces a scheduled logger that captures index usage
statistics logs on a time interval.

Release note (sql change): Initial implementation of a scheduled logger
used to capture index usage statistics to the telemetry logging channel.

Release justification: Category 4: Low risk, high benefit changes to
existing functionality.
@THardy98 THardy98 force-pushed the scheduled_logging_capture_index_usage_stats branch from 1d8327f to d09a45e Compare March 1, 2022 00:16
@THardy98
Copy link
Author

THardy98 commented Mar 1, 2022

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 1, 2022

Build succeeded:

@THardy98
Copy link
Author

blathers backport 21.2

@blathers-crl
Copy link

blathers-crl bot commented Mar 25, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d09a45e to blathers/backport-release-21.2-76886: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Capture index usage metrics in telemetry logging channel
6 participants