Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ttl: ensure each DistSQL processor only deletes its own data #86057

Merged
merged 1 commit into from
Aug 29, 2022
Merged

ttl: ensure each DistSQL processor only deletes its own data #86057

merged 1 commit into from
Aug 29, 2022

Conversation

ecwall
Copy link
Contributor

@ecwall ecwall commented Aug 12, 2022

refs #85800

Fix span/range calculation to filter out rows owned by other nodes.

Rows are grouped by node via DistSQLPlanner.PartitionSpans instead
of RangeIterator. Therefore, rows in the same range but a different table
or index no longer need to be filtered out.

Release note: None

Release justification: TTL performance bug fix.

@ecwall ecwall requested review from rafiss, otan and a team August 12, 2022 20:30
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@otan otan removed their request for review August 15, 2022 00:52
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)


pkg/sql/ttl/ttljob/ttljob_processor.go line 220 at r1 (raw file):

			progress := md.Progress
			existingRowCount := progress.Details.(*jobspb.Progress_RowLevelTTL).RowLevelTTL.RowCount
			progress.Details.(*jobspb.Progress_RowLevelTTL).RowLevelTTL.RowCount += processorRowCount

should this use atomic.LoadInt64 first to access the processorRowCount?


pkg/sql/ttl/ttljob/ttljob_processor.go line 222 at r1 (raw file):

			progress.Details.(*jobspb.Progress_RowLevelTTL).RowLevelTTL.RowCount += processorRowCount
			ju.UpdateProgress(progress)
			log.Infof(

how frequently will this log? it feels like it could be something that should only be shown with log.Vinfof(ctx, 2, ...)


pkg/sql/ttl/ttljob/ttljob_processor.go line 233 at r1 (raw file):

// rangeRowCount should be checked even if the function returns an error because it may have partially succeeded
func (ttl *ttlProcessor) runTTLOnRange(

nit: function receivers are usually one character by convention. otherwise this is easily confused with the ttl package. (and use the same for all the functions that have this receiver.) i'd say use either t or p.


pkg/sql/ttl/ttljob/ttljob_processor.go line 302 at r1 (raw file):

Previously, ecwall (Evan Wall) wrote…

It looks like the nodes can select outside of the range they are leaseholders for so I need to fix that, but I'll do it separately.

this should be a TODO in the code, not just a comment on the PR


pkg/sql/ttl/ttljob/ttljob_processor.go line 303 at r1 (raw file):

		expiredRowsPKs, err := selectBuilder.run(ctx, ie)
		log.Dev.Infof(
			ctx,

nit: use log.Vinfof(ctx, 2, ...)


pkg/sql/ttl/ttljob/ttljob_processor.go line 305 at r1 (raw file):

			ctx,
			"TTL selected rows jobID=%d processorID=%d tableID=%d expiredRowsPKs=%s",
			ttlSpec.JobID, ttl.ProcessorID, details.TableID, expiredRowsPKs,

how big is expiredRowsPKs? the log won't be useful if this prints out hundreds/thousands of entries.

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/ttl/ttljob/ttljob_processor.go line 220 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

should this use atomic.LoadInt64 first to access the processorRowCount?

processorRowCount isn't being modified anymore by this point so there isn't a race between updates and reads.


pkg/sql/ttl/ttljob/ttljob_processor.go line 222 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

how frequently will this log? it feels like it could be something that should only be shown with log.Vinfof(ctx, 2, ...)

It will log once per leaseholder per TTL run. I can change it to that though.


pkg/sql/ttl/ttljob/ttljob_processor.go line 233 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: function receivers are usually one character by convention. otherwise this is easily confused with the ttl package. (and use the same for all the functions that have this receiver.) i'd say use either t or p.

Changed to t.


pkg/sql/ttl/ttljob/ttljob_processor.go line 302 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

this should be a TODO in the code, not just a comment on the PR

Addressed in prior PR


pkg/sql/ttl/ttljob/ttljob_processor.go line 303 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: use log.Vinfof(ctx, 2, ...)

Addressed in prior PR.


pkg/sql/ttl/ttljob/ttljob_processor.go line 305 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

how big is expiredRowsPKs? the log won't be useful if this prints out hundreds/thousands of entries.

Removed log.

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/ttl/ttljob/ttljob_processor.go line 157 at r4 (raw file):

		ri := kvcoord.MakeRangeIterator(serverCfg.DistSender)
		for _, span := range ttlSpec.Spans {
			spanStartKey := span.Key

This logic works correctly in the multi-node test, but the single node tests are failing. I don't quite understand the relationship between a span and a range and need help with this logic.

@ecwall ecwall changed the title ttl: fix flaky multi node test ttl: ensure each DistSQL processor only deletes its own data Aug 22, 2022
@ecwall ecwall requested a review from rafiss August 25, 2022 02:10
@yuzefovich
Copy link
Member

I think we need to backport fixes to the key decoding to 22.1 if TTL feature is enabled for tenants in serverless.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, i just had questions since i don't fully understand the fix

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @rafiss)


pkg/sql/ttl/ttljob/ttljob_processor.go line 153 at r6 (raw file):

		// Iterate over every range to feed work for the goroutine processors.
		var alloc tree.DatumAlloc
		for _, span := range ttlSpec.Spans {

interesting fix -- why don't we need the rangeIterator any more? could you explain in the commit message?


pkg/sql/ttl/ttljob/ttljob_processor.go line 374 at r6 (raw file):

	rKey := key.AsRawKey()

	// If any of these errors, that means we reached an "empty" key, which

i didn't follow why these two checks should be removed. could you explain in the commit message?


pkg/sql/ttl/ttljob/ttljob_test.go line 198 at r6 (raw file):

}

// todo(ewall): migrate usages to verifyExpiredRows and switch SPLIT AT usage to SplitTable

i thought we already did the SplitTable change?

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rafiss)


pkg/sql/ttl/ttljob/ttljob_processor.go line 153 at r6 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

interesting fix -- why don't we need the rangeIterator any more? could you explain in the commit message?

I think the range iterator was to group the batches by range to make them more efficient, but ttlSpec.Spans (sourced from DistSQLPlanner.PartitionSpans) already only has the spans relevant to this node.

I think in the future we can remove the parallel Span processing and increase the batch size to reduce the number of config params, but I need to test this in the workload.


pkg/sql/ttl/ttljob/ttljob_processor.go line 374 at r6 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i didn't follow why these two checks should be removed. could you explain in the commit message?

I don't think these branches were being hit anymore. @otan said tests were previously failing without them, but now the tests pass without them: https://cockroachlabs.slack.com/archives/C01RX2G8LT1/p1661281166206439

I'll update the message.


pkg/sql/ttl/ttljob/ttljob_test.go line 198 at r6 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i thought we already did the SplitTable change?

I did for the multi-node tests, but other tests were failing because of some Voter error so I need to look into it more.

@ecwall ecwall requested a review from rafiss August 26, 2022 18:49
refs #85800

Fix span/range calculation to filter out rows owned by other nodes.

Rows are grouped by node via DistSQLPlanner.PartitionSpans instead
of RangeIterator. Therefore, rows in the same range but a different table
or index no longer need to be filtered out.

Release note: None

Release justification: TTL performance bug fix.
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks for the explanations

Reviewed 3 of 4 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rafiss)

@ecwall
Copy link
Contributor Author

ecwall commented Aug 29, 2022

bors r=rafiss

@craig
Copy link
Contributor

craig bot commented Aug 29, 2022

Build succeeded:

@craig craig bot merged commit 21365bf into cockroachdb:master Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants