Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: update RestoreDataProcessor to use ProcessorBase #53905

Merged
merged 1 commit into from
Sep 9, 2020

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Sep 3, 2020

Previously, RestoreDataProcessor would not properly signal to consumers
that it had encountered an error and was closing. This meant that it
would not drain its inputs. This could result in the restore DistSQL
flow becoming stuck, since the SplitAndScatter processor would be
blocked on sending a row to the RestoreDataProcessor which would already
be closed.

Fixes #53900.

Release justification: bug fix
Release note (bug fix): A failure while restoring data, may have
sometimes resulted in the restore job becoming stuck.

@pbardea pbardea requested review from yuzefovich, a team and miretskiy and removed request for a team September 3, 2020 15:51
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at the test, but the processor changes LGTM.

nit: for the release note: this bug has been present only on 20.2 branch, right? I'd then explicitly call it out.

Reviewed 2 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @yuzefovich)


pkg/ccl/backupccl/full_cluster_backup_restore_test.go, line 446 at r1 (raw file):

	// Bugger the backup by removing the SST files. (Note this messes up all of
	// the backups, but there is only one at this point).

super nit: period should be before parenthesis :)


pkg/ccl/backupccl/restore_data_processor.go, line 43 at r1 (raw file):

	progCh  chan execinfrapb.RemoteProducerMetadata_BulkProcessorProgress
	lastErr error
	alloc   *rowenc.DatumAlloc

nit: I think usually we use rowenc.DatumAlloc (value, not pointer), then we don't need to explicitly initialize it.


pkg/ccl/backupccl/restore_data_processor.go, line 83 at r1 (raw file):

func (rd *restoreDataProcessor) Start(ctx context.Context) context.Context {
	ctx = rd.StartInternal(ctx, "restore-data")

Not sure how important this is, but I think we usually call input.Start(ctx) (and ignore it's returned possibly updated ctx) before calling StartInternal.


pkg/ccl/backupccl/restore_data_processor.go, line 84 at r1 (raw file):

func (rd *restoreDataProcessor) Start(ctx context.Context) context.Context {
	ctx = rd.StartInternal(ctx, "restore-data")
	ctx, span := tracing.ChildSpan(ctx, "restoreDataProcessor")

I don't think we need this anymore, right? StartInternal will create a span.


pkg/ccl/backupccl/restore_data_processor.go, line 92 at r1 (raw file):

	rd.input.Start(ctx)
	var err error
	rd.kr, err = storageccl.MakeKeyRewriterFromRekeys(rd.spec.Rekeys)

Maybe this should happen in the constructor above? Then we can propagate the error.


pkg/ccl/backupccl/restore_data_processor.go, line 102 at r1 (raw file):

}

// Run implements the execinfra.Processor interface.

nit: the comment needs an adjustment.


pkg/ccl/backupccl/restore_data_processor.go, line 147 at r1 (raw file):

		}

		log.VEventf(context.TODO(), 1 /* level */, "importing span %v", entry.Span)

You can use rd.Ctx here and below.

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @yuzefovich)


pkg/ccl/backupccl/full_cluster_backup_restore_test.go, line 446 at r1 (raw file):

Previously, yuzefovich wrote…

super nit: period should be before parenthesis :)

Done.


pkg/ccl/backupccl/restore_data_processor.go, line 43 at r1 (raw file):

Previously, yuzefovich wrote…

nit: I think usually we use rowenc.DatumAlloc (value, not pointer), then we don't need to explicitly initialize it.

Done.


pkg/ccl/backupccl/restore_data_processor.go, line 83 at r1 (raw file):

Previously, yuzefovich wrote…

Not sure how important this is, but I think we usually call input.Start(ctx) (and ignore it's returned possibly updated ctx) before calling StartInternal.

Done.


pkg/ccl/backupccl/restore_data_processor.go, line 84 at r1 (raw file):

Previously, yuzefovich wrote…

I don't think we need this anymore, right? StartInternal will create a span.

Oh, nice. Done.


pkg/ccl/backupccl/restore_data_processor.go, line 92 at r1 (raw file):

Previously, yuzefovich wrote…

Maybe this should happen in the constructor above? Then we can propagate the error.

Done. Also moved the initialization of the progCh there since that's a better place for it.


pkg/ccl/backupccl/restore_data_processor.go, line 102 at r1 (raw file):

Previously, yuzefovich wrote…

nit: the comment needs an adjustment.

Done, and added comments to the RowSource methods.


pkg/ccl/backupccl/restore_data_processor.go, line 147 at r1 (raw file):

Previously, yuzefovich wrote…

You can use rd.Ctx here and below.

Ah, the wonders of processorBase. Thanks - done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


pkg/ccl/backupccl/restore_data_processor.go, line 76 at r2 (raw file):

	}

	// We don't have to worry about this go routine leaking because next we loop over progCh

nit: the comment seems to be incorrect now.

I think we need to close this channel in close() method.

Also, I don't think I see progCh being used now, am I missing something?

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @yuzefovich)


pkg/ccl/backupccl/restore_data_processor.go, line 76 at r2 (raw file):

Previously, yuzefovich wrote…

nit: the comment seems to be incorrect now.

I think we need to close this channel in close() method.

Also, I don't think I see progCh being used now, am I missing something?

Nope, that's right. Needed to be cleaned up, removed entirely.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: if Yevgeniy is happy with the tests.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy)

@pbardea pbardea force-pushed the restore-flow-stuck branch 2 times, most recently from bbf3398 to ed43545 Compare September 3, 2020 21:22
Previously, RestoreDataProcessor would not properly signal to consumers
that it had encountered an error and was closing. This meant that it
would not drain its inputs. This could result in the restore DistSQL
flow becoming stuck, since the SplitAndScatter processor would be
blocked on sending a row to the RestoreDataProcessor which would already
be closed.

Release justification: bug fix
Release note (bug fix): A failure while restoring data, may have
sometimes resulted in the restore job becoming stuck. This bug was only
present on 20.2 alphas and betas.
@pbardea
Copy link
Contributor Author

pbardea commented Sep 8, 2020

@miretskiy could I get a quick review for the tests here?

Also cc @dt for a riskiness-signoff since this is a larger change bug fix.

@pbardea pbardea requested a review from dt September 8, 2020 14:45
@pbardea
Copy link
Contributor Author

pbardea commented Sep 9, 2020

TFTRs!
r=yuzefovich,dt

@pbardea
Copy link
Contributor Author

pbardea commented Sep 9, 2020

bors r=yuzefovich,dt

@craig
Copy link
Contributor

craig bot commented Sep 9, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

restore: RESTORE flow hangs when RestoreDataProcessor quickly throws an error
4 participants