sql: scheduled logger to capture index usage stats #76886

THardy98 · 2022-02-22T14:02:55Z

Resolves: #72486

This change introduces a scheduled logger that captures index usage
statistics logs on a time interval.

Release note (sql change): Initial implementation of a scheduled logger
used to capture index usage statistics to the telemetry logging channel.

Release justification: Category 4: Low risk, high benefit changes to
existing functionality.

cockroach-teamcity · 2022-02-22T14:03:04Z

This change is

jocrl

I'm asking two questions but they're not blockers to approval, so don't worry about them if you're busy and trying to get things in 🙂.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @THardy98)

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 146 at r1 (raw file):

// logIndexUsageStatsWithDelay logs a slice of eventpb.EventPayload at half
// second intervals (2 logs per second) to avoid exceeding the 10 log-line per
// second limit per node on the telemetry logging pipeline.

Just curious, is this 10 log-line limit cumulative across other things that might also be logging in the telemetry pipeline?

Code quote:

// second intervals (2 logs per second) to avoid exceeding the 10 log-line per
// second limit per node on the telemetry logging pipeline.

pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

  CommonEventDetails common = 1 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true];

  // Couldn't use roachpb.CollectedIndexUsageStatistics due to circular dependency.

Just out of curiosity, why is the solution here not to pull something out in a way that removes the circular dependency (perhaps as a long-term fix if you're trying to get this in)? Are the module organizations too complicated/don't lend to reorganization/we're usually fine was just duplicating things?

jocrl

Oh, and to add tests before un-drafting.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @THardy98)

matthewtodd

Reviewed 5 of 10 files at r1, 1 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, and @THardy98)

pkg/server/server_sql.go, line 1249 at r3 (raw file):

	)

	scheduledlogging.StartCapturingIndexUsageStats(ctx, stopper, s.execCfg.DB, s.execCfg.Settings, s.internalExecutor)

Maybe just call this Start? Since we're dealing with a global limit on the telemetry pipeline, I imagine any future scheduled logging would happen inside the same async task?

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 199 at r3 (raw file):

// second intervals (2 logs per second) to avoid exceeding the 10 log-line per
// second limit per node on the telemetry logging pipeline.
func logIndexUsageStatsWithDelay(

Also, I notice that this function doesn't care about the EventPayload contents. Perhaps there's a natural seam here where we've got one thing selecting what we want to log (just index usage stats for now) and another thing that slowly writes them into the logging pipeline. I see now that decoupling them means you need some kind of buffering, which gets more complicated, so maybe that's more than we need to think about for now?

pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

Previously, jocrl (Josephine) wrote…

Just out of curiosity, why is the solution here not to pull something out in a way that removes the circular dependency (perhaps as a long-term fix if you're trying to get this in)? Are the module organizations too complicated/don't lend to reorganization/we're usually fine was just duplicating things?

I'm curious about this, too.

THardy98

Added tests :)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, and @matthewtodd)

pkg/server/server_sql.go, line 1249 at r3 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Maybe just call this Start? Since we're dealing with a global limit on the telemetry pipeline, I imagine any future scheduled logging would happen inside the same async task?

I've changed this to Start, and yes, I believe that the idea going forward. From what I've seen, Start also seems to be the convention when starting background processes.

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 146 at r1 (raw file):

Previously, jocrl (Josephine) wrote…

Just curious, is this 10 log-line limit cumulative across other things that might also be logging in the telemetry pipeline?

Yes. Currently, the only other event being emitted to the telemetry pipeline is SampledQuery. Consequently, I reduced the frequency at which it could emit logs from 10 logs per second, to 8 per second, to allow CaptureIndexUsageStats to emit 2 logs per second.

IIUC, there could be some improvement on this in the future.

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 199 at r3 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Also, I notice that this function doesn't care about the EventPayload contents. Perhaps there's a natural seam here where we've got one thing selecting what we want to log (just index usage stats for now) and another thing that slowly writes them into the logging pipeline. I see now that decoupling them means you need some kind of buffering, which gets more complicated, so maybe that's more than we need to think about for now?

I'm not sure I fully understand this, but maybe I can provide some context with my own efforts to write a logging scheduler that supported any logger. Initially, I had written the scheduler to allow for multiple loggers that wanted to emit events on an interval. It was effectively the same as the current scheduler but it started a separate go-routine for each logger, so was a bit expensive. It didn't have any need for buffering as each logger was responsible for their own schedule and writing, but the number of go-routines likely wouldn't scale well over time. I was thinking of using some sort of a buffer that all logs could write to, and a background go-routine would periodically write the actual logs from the buffer. Is this in line with what you were thinking?

pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

I'm curious about this, too.

I don't think it would be very difficult. It would just need a new .proto file in a separate package (maybe idxusage in this case?). That being said, I'm not aware if moving roachpb.CollectedIndexUsageStatistics would cause for other issues (or maybe cause a different import cycle).

jocrl

Thank you! Approving for addressing my questions, and I'll leave the stamp to Matthew.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @matthewtodd, and @THardy98)

pkg/util/log/eventpb/telemetry.proto, line 53 at r1 (raw file):

Previously, THardy98 (Thomas Hardy) wrote…

I don't think it would be very difficult. It would just need a new .proto file in a separate package (maybe idxusage in this case?). That being said, I'm not aware if moving roachpb.CollectedIndexUsageStatistics would cause for other issues (or maybe cause a different import cycle).

I would suggest that if it can be done, and if they are conceptually the same thing, then pulling it out to avoid duplication would be helpful. I'm not familiar enough to assess how neutral/concerning this is, and so my uninformed sentiment is to leave it up to you whether to make the change in this pr or punt to the future.

jocrl · 2022-02-25T19:46:58Z

Also, FYI on the failing TeamCity build 🙃

Azhng

Reviewed 1 of 8 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):

// on a scheduled interval.
type CaptureIndexUsageStatsLoggingScheduler struct {
	DB                      *kv.DB

how come this is exported ?

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 83 at r11 (raw file):

type CaptureIndexUsageStatsLoggingScheduler struct {
	DB                      *kv.DB
	cs                      *cluster.Settings

(optional) nit: i think most of the code base i've seen uses either st or settings as abbreivation rather than cs

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 152 at r11 (raw file):

				err := s.DB.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
					return captureIndexUsageStats(ctx, s.ie, stopper, telemetryCaptureIndexUsageStatsLoggingDelay.Get(&s.cs.SV))

Hmm why are we starting a transaction here? The txn isn't passed to captureIndexUsageStats()

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 156 at r11 (raw file):

				if err != nil {
					log.Errorf(ctx, "error capturing index usage stats: %+v", err)
				}

Hmm I feelerror is indicating something severely wrong with the cluster and demand immediate attention. Might be worth to just call this a Warning instead. And guard it behind log.V() so it doesn't pollute the log file too much.

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 225 at r11 (raw file):

			}

			databaseName := tree.MustBeDString(row[0])

minor nit: hmm why not just directly use the databaseName that we already have? the '%s' as database_name feels a bit unnecessary.

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 257 at r11 (raw file):

		if err != nil {
			return err
		}

small nit: inline the err assignment

if err = it.Close(); err != nil {
  return err
}

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 43 at r11 (raw file):

func (s *stubDurations) setLoggingDuration(d time.Duration) {
	s.RWMutex.Lock()

nit: you can just do s.Lock() since the lock is embedded in the struct

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 66 at r11 (raw file):

}

func installTelemetryLogFileSink(sc *log.TestLogScope, t *testing.T) func() {

nit: usually t *testing.T is the first args in the function

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 78 at r11 (raw file):

	if err := cfg.Validate(&dir); err != nil {
		t.Fatal(err)
	}

small nit: use requirehere?

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 162 at r11 (raw file):

	// of logs for each index.
	testutils.SucceedsWithin(t, func() error {
		return checkNumTotalEntriesAndNumIndexEntries(expectedTotalNumEntriesInSingleInterval, expectedNumberOfIndividualIndexEntriesInSingleInterval)

style nit: wrap the long line

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 205 at r11 (raw file):

		0,
		math.MaxInt64,
		10000,

Hmm, what do these literals mean?

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 219 at r11 (raw file):

	testData := []time.Duration{
		0 * time.Second,
		stubScheduleInterval,    /* second difference between first and second schedule */

style nit: we don't have too much inline comments with /**/ like this one. Move it above the stubScheduleInterval and swtich to use //

pkg/sql/scheduledlogging/captured_index_usage_stats_test.go, line 220 at r11 (raw file):

		0 * time.Second,
		stubScheduleInterval,    /* second difference between first and second schedule */
		sd.getOverlapDuration(), /* second difference between second and third schedule */

to clarify, do you mean the difference in number of seconds ?

THardy98

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @jocrl, @maryliag, @matthewtodd, and @THardy98)

pkg/sql/scheduledlogging/captured_index_usage_stats.go, line 82 at r11 (raw file):