Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: introduce the span config manager #68522

Merged
merged 2 commits into from
Aug 18, 2021

Conversation

arulajmani
Copy link
Collaborator

See individual commits for details.

@arulajmani arulajmani requested a review from a team as a code owner August 5, 2021 23:54
@arulajmani arulajmani requested a review from a team August 5, 2021 23:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'll defer to others for actual the rubber stamp. My comments are mostly around adding more comments. I think it'd be very helpful to explain in text somewhere what various pieces we're looking to introduce over subsequent PRs and how we expect them to get stitched together. It'd make it easier+faster to review.

To that end, I'd make the commit messages a bit more verbose -- I think it's too little context unless you have it all paged in already. For e.g., explain that this job is running per-tenant, and what it'll eventually be slated to do. Also, refer to #67679. We could alternatively create an issue containing all the details + pointers to relevant bits in the prototype branch so we know the final state we're headed towards and how this intermediate PR gets us there.

Reviewed 5 of 5 files at r1, 23 of 23 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)


pkg/jobs/registry_test.go, line 128 at r2 (raw file):

		Knobs: base.TestingKnobs{
			SpanConfig: &spanconfig.TestingKnobs{
				ManagerDisableJobCreation: true,

Why are we disabling the span config manager in this test? Maybe add an inline comment.


pkg/server/admin_test.go, line 1437 at r2 (raw file):

		Knobs: base.TestingKnobs{
			SpanConfig: &spanconfig.TestingKnobs{
				ManagerDisableJobCreation: true,

Ditto. Out of curiosity -- is there anything else that's non-invasive-ish that we can do to generalize these tests to not need to use this knob?


pkg/server/server_sql.go, line 804 at r2 (raw file):

	{
		knobs, _ := cfg.TestingKnobs.SpanConfig.(*spanconfig.TestingKnobs)
		reconciliationMgr := spanconfigmanager.New(

Add some commentary here explaining how+why these things are stitched together.


pkg/spanconfig/spanconfig.go, line 17 at r2 (raw file):

// configs with the cluster's.
type ReconciliationDependencies interface {
	// KVAccessor

If you're making mention of future interfaces here, add some additional commentary. This is not providing the most helpful context I think.


pkg/spanconfig/testing_knobs.go, line 18 at r2 (raw file):

// for testing.
type TestingKnobs struct {
	ManagerDisableJobCreation bool

Add a comment, mention it's the auto reconciliation job that's not being created.


pkg/spanconfig/testing_knobs.go, line 24 at r2 (raw file):

}

// ModuleTestingKnobs makes TestingKnobs a base.ModuleTestingKnobs.

// ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface.


pkg/spanconfig/spanconfigmanager/manager.go, line 28 at r2 (raw file):

)

// Manager is responsible for ensuring that only one (and only one) span config

"that one (and only one)"


pkg/spanconfig/spanconfigmanager/manager.go, line 41 at r2 (raw file):

var _ spanconfig.ReconciliationDependencies = &Manager{}

// New constructs a new reconciliation manager.

Drop the "reconciliation manager" language, I think this was a remnant from our earlier package structure with the zcfgreconciler guy.


pkg/spanconfig/spanconfigmanager/manager.go, line 61 at r2 (raw file):

}

// StartJobIfNoneExists will create and start the span config reconciliation job

s/Exists/Exist?


pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r2 (raw file):

	manager.StartJobIfNoneExists(ctx)

	testutils.SucceedsSoon(t, func() error {

Is there a way to cleanly test that the concurrent attempts don't do anything? Perhaps making sure there are no outstanding stopper tasks. I think with the test as is it's possible to flake your way into success if it completes before the other StartJobIfNoneExists calls get to do anything.


pkg/sql/exec_util.go, line 1022 at r2 (raw file):

	// StartSpanConfigReconciliationJobHook is called during the SQL server
	// startup process, to idempotently create and start a span config

s/,/
s/reconciliation for the tenant if none exists/reconciliation job


pkg/sql/exec_util.go, line 1027 at r2 (raw file):

}

// StartSpanConfigReconciliationJobHook is used to create and start the span

// StartSpanConfigReconciliationJobHook is used to idempotently create and start the span config reconciliation job.


pkg/sql/delegate/show_jobs.go, line 66 at r1 (raw file):

	sqlStmt := fmt.Sprintf("%s %s %s", selectClause, whereClause, orderbyClause)
	if n.Block {
		// The AutoSpanConfigReconciliation job is never meant to complete so we

s/never meant/not meant
s/via the WHERE//


pkg/sql/delegate/show_jobs.go, line 67 at r1 (raw file):

	if n.Block {
		// The AutoSpanConfigReconciliation job is never meant to complete so we
		// explicitly exclude it here via the WHERE clause. This is required because

I'm surprised we need this. What do you mean by not having control over what filter is being applied first? We're filtering out the automatic jobs in the subquery -- aren't we guaranteed to never see them here? Unless the user's trying SHOW AUTOMATIC JOBS WHEN COMPLETE, but even then -- do we want to be smart about it here and filter out an automatic job?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very fast first pass

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


pkg/spanconfig/spanconfigmanager/manager.go, line 29 at r2 (raw file):

// Manager is responsible for ensuring that only one (and only one) span config
// reconciliation job is ever created. It also provides the job with all

I don't think is ever created is the right language. I would think instead it should be something like, is ever running.


pkg/spanconfig/spanconfigmanager/manager.go, line 83 at r2 (raw file):

	}

	var startableJob *jobs.StartableJob

I don't think StartableJob is what you're looking for. It's a bit of a hack to get some output into the caller's goroutine. I think you just want a regular ol' job.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things mostly look good. I found myself agreeing with all of Irfan's comments and left a few of my own. It's exciting to see this!

Reviewed 5 of 5 files at r1, 23 of 23 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @irfansharif)


pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r2 (raw file):

// iff it hadn't been created already.
func (m *Manager) StartJobIfNoneExists(ctx context.Context) {
	_ = m.stopper.RunAsyncTask(ctx, "create-and-start-reconciliation-job", func(ctx context.Context) {

Out of curiosity, is there a benefit to pulling all of createAndStartJobIfNoneExist into an async task instead of just the job in cases where it doesn't yet exist? I don't think there's a performance concern here, and it's usually easier to use an API that is synchronous for as long as possible. Irfan's comment on the test demonstrates this.

EDIT: beyond what I said here, if you pushed the async part back, you could also return a bool indicating whether the job started or not. This would make testing a breeze compared to what you're doing now.


pkg/spanconfig/spanconfigmanager/manager.go, line 122 at r2 (raw file):

	ctx context.Context, txn *kv.Txn,
) (exists bool, _ error) {
	const stmt = `

This query is transactional all the way down to the scan of system.jobs, right? I belive so based on a reading of crdbInternalJobsTable, but we should confirm, because, without that, the transaction in createAndStartJobIfNoneExist isn't enough to provide mutual exclusion and guarantee only one job running at a time.

Is this worth a test? For instance, you could spin up two calls to createAndStartJobIfNoneExist, let both get past this point and block in a testing hook, then let them proceed and assert that only one succeeds and the other txn restarts and notices the job. I guess what I'm asking for is a more deterministic version of TestManagerJobCreation that doesn't rely on a lucky race to reveal bugs.


pkg/spanconfig/spanconfigmanager/manager.go, line 126 at r2 (raw file):

         SELECT job_id
           FROM [SHOW AUTOMATIC JOBS]
          WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

Do we want to hard code this 'AUTO SPAN CONFIG RECONCILIATION' string? Should it come from a const defined somewhere?


pkg/sql/planner.go, line 542 at r2 (raw file):

}

// SpanConfigReconciliationJobDeps returns the spanconfig.ReconciliationJobDeps

nit: missing period.


pkg/sql/delegate/show_jobs.go, line 43 at r1 (raw file):

		// Display all [only automatic] jobs without selecting specific jobs.
		if n.Automatic {
			typePredicate = fmt.Sprintf("job_type = '%s' OR job_type = '%s'",

Is now the time to abstract this by pulling out some kind of jobspb.AutomaticJobs static array?

Even if not, use IN here and NOT IN below to make this a little harder to get wrong.


pkg/sql/delegate/show_jobs.go, line 67 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm surprised we need this. What do you mean by not having control over what filter is being applied first? We're filtering out the automatic jobs in the subquery -- aren't we guaranteed to never see them here? Unless the user's trying SHOW AUTOMATIC JOBS WHEN COMPLETE, but even then -- do we want to be smart about it here and filter out an automatic job?

I had the same question. Isn't the filter above sufficient?


pkg/sql/logictest/testdata/logic_test/auto_span_config_reconciliation_job, line 20 at r2 (raw file):


# Ensure that the job is not cancelable.
statement error not cancelable

Is there a pgcode attached to this?

@blathers-crl blathers-crl bot requested a review from irfansharif August 12, 2021 00:57
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)


pkg/jobs/registry_test.go, line 128 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Why are we disabling the span config manager in this test? Maybe add an inline comment.

Done.


pkg/server/admin_test.go, line 1437 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Ditto. Out of curiosity -- is there anything else that's non-invasive-ish that we can do to generalize these tests to not need to use this knob?

Done.


pkg/server/server_sql.go, line 804 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Add some commentary here explaining how+why these things are stitched together.

Done.


pkg/spanconfig/spanconfig.go, line 17 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

If you're making mention of future interfaces here, add some additional commentary. This is not providing the most helpful context I think.

Done.


pkg/spanconfig/testing_knobs.go, line 18 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Add a comment, mention it's the auto reconciliation job that's not being created.

Done.


pkg/spanconfig/testing_knobs.go, line 24 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

// ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface.

Done.


pkg/spanconfig/spanconfigmanager/manager.go, line 28 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

"that one (and only one)"

Done.


pkg/spanconfig/spanconfigmanager/manager.go, line 29 at r2 (raw file):

Previously, ajwerner wrote…

I don't think is ever created is the right language. I would think instead it should be something like, is ever running.

Done.


pkg/spanconfig/spanconfigmanager/manager.go, line 41 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Drop the "reconciliation manager" language, I think this was a remnant from our earlier package structure with the zcfgreconciler guy.

Done.


pkg/spanconfig/spanconfigmanager/manager.go, line 61 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/Exists/Exist?

"job" is singular, so it should be exists right?


pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Out of curiosity, is there a benefit to pulling all of createAndStartJobIfNoneExist into an async task instead of just the job in cases where it doesn't yet exist? I don't think there's a performance concern here, and it's usually easier to use an API that is synchronous for as long as possible. Irfan's comment on the test demonstrates this.

EDIT: beyond what I said here, if you pushed the async part back, you could also return a bool indicating whether the job started or not. This would make testing a breeze compared to what you're doing now.

I changed a few things around here, most notably I switched from a StartableJob to a regular job based on Andrew's comment. There isn't much to do after we've written to system.jobs for regular jobs. I'm notifying the registry to pick this new job up but from what I understand this is merely a nudge and not required.

This NotifyToAdoptJobs isn't blocking right now but it could be in the future as it sends things on an unbuffered channel. As such, I have this part in an async task for now. Is this overkill? Should I just make the whole thing synchronous?


pkg/spanconfig/spanconfigmanager/manager.go, line 122 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This query is transactional all the way down to the scan of system.jobs, right? I belive so based on a reading of crdbInternalJobsTable, but we should confirm, because, without that, the transaction in createAndStartJobIfNoneExist isn't enough to provide mutual exclusion and guarantee only one job running at a time.

Is this worth a test? For instance, you could spin up two calls to createAndStartJobIfNoneExist, let both get past this point and block in a testing hook, then let them proceed and assert that only one succeeds and the other txn restarts and notices the job. I guess what I'm asking for is a more deterministic version of TestManagerJobCreation that doesn't rely on a lucky race to reveal bugs.

I changed TestManagerJobCreation to look like what you described above. Let me know what you think


pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Is there a way to cleanly test that the concurrent attempts don't do anything? Perhaps making sure there are no outstanding stopper tasks. I think with the test as is it's possible to flake your way into success if it completes before the other StartJobIfNoneExists calls get to do anything.

Changed this test around to be a bit more deterministic, lemme know what you think.


pkg/sql/exec_util.go, line 1022 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/,/
s/reconciliation for the tenant if none exists/reconciliation job

Done.


pkg/sql/exec_util.go, line 1027 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

// StartSpanConfigReconciliationJobHook is used to idempotently create and start the span config reconciliation job.

Done.


pkg/sql/delegate/show_jobs.go, line 43 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is now the time to abstract this by pulling out some kind of jobspb.AutomaticJobs static array?

Even if not, use IN here and NOT IN below to make this a little harder to get wrong.

Done


pkg/sql/delegate/show_jobs.go, line 66 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/never meant/not meant
s/via the WHERE//

No longer applicable.


pkg/sql/delegate/show_jobs.go, line 67 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I had the same question. Isn't the filter above sufficient?

tldr: this is no longer a problem after @ajwerner's changes.

This codepath corresponds to a SHOW JOBS WHEN COMPLETE <select_stmt> where the select statement returns a list of job IDs to wait for. This select statement is used to inform the whereClause above. The way this query was written before the optimizer could decide to reorder the filters, producing a plan like:

 • hash join (semi)
  │ equality: (job_id) = (?column?)
  │ right cols are key
  │
  ├── • filter
  │   │ filter: CASE finished IS NULL WHEN true THEN CASE pg_sleep(1.0) WHEN true THEN crdb_internal.force_retry('24:00:00') ELSE 0 END ELSE 0 END = 0
  │   │
  │   └── • virtual table
  │         table: jobs@primary
  │
  └── • union
      │ estimated row count: 2
      │
      ├── • values
      │     size: 1 column, 1 row
      │
      └── • values
            size: 1 column, 1 row

So this query could end up getting stuck in a constant retry loop. With @ajwerner's changes that rework this to use a CTE this can no longer happen.


pkg/sql/logictest/testdata/logic_test/auto_span_config_reconciliation_job, line 20 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is there a pgcode attached to this?

The jobs subsystem doesn't return one when trying to cancel a non-cancellable job.

@arulajmani arulajmani requested a review from RichardJCai August 12, 2021 00:58
@arulajmani arulajmani force-pushed the spanconfig-manager branch 2 times, most recently from 9fe366a to 0db6ece Compare August 12, 2021 13:53
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to get some people in a room and make decisions on how to deal with this job stuff.


// OnFailOrCancel implements the jobs.Resumer interface.
func (r *resumer) OnFailOrCancel(context.Context, interface{}) error {
return errors.AssertionFailedf("span config reconciliation job can never fail or be canceled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💵 says we get this sentry report before the end of 2022.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I don't know how these things end up getting reported. Maybe no report because the jobs system doesn't send them.


var job *jobs.Job
if err := m.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
exists, err := m.checkIfReconciliationJobExists(ctx, txn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Azhng has a thing for this coming in #68434, see RunningJobExists.

@@ -996,6 +1015,9 @@ func (s *SQLServer) preStart(
return err
}

// Create and start the span config reconciliation job if none exist.
_ = s.execCfg.StartSpanConfigReconciliationJobHook(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a blocking call, I don't know that it belongs in preStart here.

Comment on lines 62 to 58
// StartJobIfNoneExists will create and start the span config reconciliation job
// iff it hadn't been created already. Returns a boolean indicating if the job
// was started.
func (m *Manager) StartJobIfNoneExists(ctx context.Context) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this synchronous method is really that helpful. What if we just had this thing sit in a loop and periodically, like every hour or something (after startup) go and run the check. If you need a synchronous hook, you can have it poke the loop or something. My guess is that the synchronous thing is handy in testing. Something smells about it being called in preStart.

// TODO(zcfg-pod): Upcoming PRs will actually make use of these reconciliation
// dependencies.
_ = rc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that the implementation of reconciliation will happen here or is is the idea that there will be another injected piece of machinery that gets invoked?

@@ -996,6 +1015,9 @@ func (s *SQLServer) preStart(
return err
}

// Create and start the span config reconciliation job if none exist.
_ = s.execCfg.StartSpanConfigReconciliationJobHook(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, doesn't this need version gating? We can't run this until after we've activated a version where all of the nodes know about this job type.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 26 files at r3, 24 of 25 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @RichardJCai)


pkg/jobs/registry_test.go, line 128 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Done.

What do you mean by "noise" here? Does the test fail if we don't disable the job creation?


pkg/jobs/jobspb/wrap.go, line 69 at r5 (raw file):

// AutomaticJobTypes is a list of automatic job types that currently exist.
var AutomaticJobTypes = [2]Type{TypeAutoCreateStats, TypeAutoSpanConfigReconciliation}

nit: s/2/.../ to avoid needing to update this length each time a new element is added.


pkg/spanconfig/spanconfigmanager/manager.go, line 122 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I changed TestManagerJobCreation to look like what you described above. Let me know what you think

I think it looks good. Thanks.


pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):

Previously, ajwerner wrote…

I'm not sure that this synchronous method is really that helpful. What if we just had this thing sit in a loop and periodically, like every hour or something (after startup) go and run the check. If you need a synchronous hook, you can have it poke the loop or something. My guess is that the synchronous thing is handy in testing. Something smells about it being called in preStart.

The synchronous nature of this method came from a previous comment. This was async back in r2, though still only single-shot. If we want to run this work inside of an async loop then I'd encourage us to push that outside of this API. Interfaces that are internally asynchronous are harder to test and harder to reason about. It means that we can't give any information back to the caller, like whether or not the job started. If a caller wants to call a synchronous method in an async task, it is free to.


pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r5 (raw file):

				require.Equal(t, job.Payload().Description, "reconciling span configurations")
			},
			ManagerAfterCheckedReconciliationJobExistsInterceptor: func(exists bool) {

We're going to call this twice during the first attempt, right? Because we expect a transaction retry. If so, we should assert that this is called exactly 3 times and that the first two passes exists=false and the third passes exists=true.


pkg/spanconfig/spanconfigmanager/manager_test.go, line 101 at r5 (raw file):

		// Only try to start the job once the main testing goroutine has reached the
		// blocker and is waiting.
		for {

This loop would be better served using a channel or a condition variable. Some way to avoid spinning on the mutex.

Maybe this should <-blocker and the other goroutine in the hook should close(blocker) inside of a sync.Once.


pkg/sql/delegate/show_jobs.go, line 59 at r5 (raw file):

				predicate.WriteString("'")
			}
			predicate.WriteString(")")

Can we replace some of these single-character strings with predicate.WriteByte? That is marginally more efficient.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The synchronous nature of this method came from a previous comment. This was async back in r2, though still only single-shot. If we want to run this work inside of an async loop then I'd encourage us to push that outside of this API. Interfaces that are internally asynchronous are harder to test and harder to reason about. It means that we can't give any information back to the caller, like whether or not the job started. If a caller wants to call a synchronous method in an async task, it is free to.

I get the desire to have the logic be implemented with a synchronous call somewhere for the reasons you indicate. What I'm pushing back against is the idea that we want that synchronous call to happen during server startup 100% of the time. Do we really want to scan the jobs table synchronously on server startup?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @RichardJCai)


pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):

Previously, ajwerner wrote…

I get the desire to have the logic be implemented with a synchronous call somewhere for the reasons you indicate. What I'm pushing back against is the idea that we want that synchronous call to happen during server startup 100% of the time. Do we really want to scan the jobs table synchronously on server startup?

What is the concern with this running on server startup? SQL migrations also run synchronously on server startup. Do you expect this to be expensive because we don't index on job_type so this results in a full table scan? If so, that's a fair concern. And beyond this, if we need to periodically check if the job exists, then we'll need some task to do that anyway, so there's no need to also perform this synchronously at startup.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What is the concern with this running on server startup? SQL migrations also run synchronously on server startup. Do you expect this to be expensive because we don't index on job_type so this results in a full table scan? If so, that's a fair concern. And beyond this, if we need to periodically check if the job exists, then we'll need some task to do that anyway, so there's no need to also perform this synchronously at startup.

The scan concerns me, yes. Imagine that the jobs table is unavailable because this node cannot start up. I guess we have that problem for startup migrations too. Fortunately those things are (slowly) on their way out. I filed #68896 to index the type because I think generally it's a good idea. Whether we like it or not, there's a movement coming in the 22.1 time frame to cut the number of synchronous, global round trips during sql server startup to a minimum. I'm trying to get ahead of that.

I think having a periodic task to create the job is a good idea. I have fears of the job failing for various reasons and I have fears about people deleting it and everything breaking.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/server/server_sql.go, line 1019 at r5 (raw file):

Previously, ajwerner wrote…

Also, doesn't this need version gating? We can't run this until after we've activated a version where all of the nodes know about this job type.

I was wrong. preStart is effectively start for the sql server.


pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):

Previously, ajwerner wrote…

The scan concerns me, yes. Imagine that the jobs table is unavailable because this node cannot start up. I guess we have that problem for startup migrations too. Fortunately those things are (slowly) on their way out. I filed #68896 to index the type because I think generally it's a good idea. Whether we like it or not, there's a movement coming in the 22.1 time frame to cut the number of synchronous, global round trips during sql server startup to a minimum. I'm trying to get ahead of that.

I think having a periodic task to create the job is a good idea. I have fears of the job failing for various reasons and I have fears about people deleting it and everything breaking.

Err ignore me about the startup, this is after we start up the kvserver.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the periodic job creation attempt the remaining blocker for this PR?

Reviewed 1 of 26 files at r3, 12 of 25 files at r4, 1 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @nvanbenschoten, and @RichardJCai)


pkg/jobs/registry_test.go, line 128 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What do you mean by "noise" here? Does the test fail if we don't disable the job creation?

Yea, I'm still not sure why we need it (if we do).


pkg/server/admin_test.go, line 1418 at r5 (raw file):

	rows := db.Query(
		t,
		fmt.Sprintf(

Do we need the fmt.Sprintf? db.Query can parametrize the query string with arguments.


pkg/server/admin_test.go, line 1508 at r5 (raw file):

		{
			"jobs",
			append(append([]int64{5, 4, 3, 2, 1}, existingSucceededIDs...), existingRunningIDs...),

This is "all jobs" right? Could instead define a expectedIDs = existingSucceededIDs + existingRunningIDs above.


pkg/server/server_sql.go, line 804 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Done.

"Instantiate a span config manager; it exposes a hook to idempotently create the span config reconciliation job and captures all relevant job dependencies."


pkg/server/server_sql.go, line 1019 at r5 (raw file):

Previously, ajwerner wrote…

I was wrong. preStart is effectively start for the sql server.

Looks like it fired up an async task internally, though not obvious looking at the signature. +1 to pulling up the async task + forever attempt to keep creating using a synchronous API here at the caller.


pkg/spanconfig/testing_knobs.go, line 18 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Done.

"ManagerDisableJobCreation disables creating the auto span config reconciliation job."


pkg/spanconfig/spanconfigjob/job.go, line 36 at r5 (raw file):

Previously, ajwerner wrote…

Is the idea that the implementation of reconciliation will happen here or is is the idea that there will be another injected piece of machinery that gets invoked?

The latter, one of the job dependencies will be simply emitting span config changes it observes through SQL. The job exists to simply issue RPCs to KV propagating those changes. It'll also checkpoint internally to make sure when resumed, it's doing the full sync.


pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):

Previously, ajwerner wrote…

Err ignore me about the startup, this is after we start up the kvserver.

Are you saying that we want one instance of the job running at a time, rather than only one job being created that runs forever? Makes sense, a small enough change, but mind explaining why that's preferable? Is it just safeguarding in case we accidentally write code resulting in the forever job failing/entering a terminal state, or us somehow purging the job away? Cool -- makes it a lot less fragile. We did in fact have bugs in our prototype where we returned errors in the job, and this makes it a lot more robust.


pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Changed this test around to be a bit more deterministic, lemme know what you think.

Now that StartJobIfNoneExists returns whether or not it was a noop, can we simplify the test to check and see that there's only true returned from concurrent attempts? We could then validate by scanning the jobs table to make sure only one instance exists. I think it should obviate the need for the new testing interceptor and all these extra channels, etc.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the periodic job creation attempt the remaining blocker for this PR?

Also version gating.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/spanconfig/spanconfigjob/job.go, line 36 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

The latter, one of the job dependencies will be simply emitting span config changes it observes through SQL. The job exists to simply issue RPCs to KV propagating those changes. It'll also checkpoint internally to make sure when resumed, it's doing the full sync.

Okay, I'll wait for the next PR to talk about what functionality lives where.

@arulajmani arulajmani requested a review from a team as a code owner August 13, 2021 23:29
@blathers-crl blathers-crl bot requested a review from irfansharif August 13, 2021 23:29
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 12 files at r8, 32 of 32 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)


-- commits, line 14 at r9 ([raw file](https://github.com/cockroachdb/cockroach/blob/f5d1d74a0dfa4d86214a941d28a6396b2d970f11/-- commits#L14)):
Type in commit message (idempootently).


pkg/jobs/registry_test.go, line 128 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

This, and other tests in this file, directly modify system.jobs and then make assertions over the contents of it. For example, TestRegistryGCPagination ensures that old terminal jobs are gc-ed away by asserting no entries are left in system.jobs. I could modify the test to add a special case for this auto span config reconciliation job or change the calls to use SHOW JOBS instead of querying system.jobs directly, but I prefer disabling the job entirely considering these tests shouldn't need to bother with the nuances of this auto job.

It's worth improving the comment here to say as much (and capitalize "disable").


pkg/server/server_sql.go, line 1037 at r9 (raw file):

			nextTick := timeutil.Now()
			for {
				nextTickCh := time.After(nextTick.Sub(timeutil.Now()))

Use time.NewTicker instead. Also, should we attach a SetOnChange hook on the cluster setting to attempt job creation when changed? If the default duration is long, it could come in handy when trying to quickly bounce it. For upgrading clusters it's possible for the reconciliation job to only going to started 10m (or whatever the setting ends up being) after the upgrade has finalized. Is that what we want? I'm not sure what we could do -- maybe a similar cluster setting changed hook for the version setting? Or maybe a "migration" that also tries to idempotently call into this job creation hook?


pkg/server/server_sql.go, line 1043 at r9 (raw file):

					// cluster version allows for it.
					if s.execCfg.Settings.Version.IsActive(ctx, clusterversion.AutoSpanConfigReconciliationJob) {
						_ = s.execCfg.StartSpanConfigReconciliationJobHook(ctx)

Return + log the error.


pkg/server/server_sql.go, line 1051 at r9 (raw file):

				}
				nextTick = nextTick.Add(
					spanconfig.CheckAndStartReconciliationJobInterval.Get(&s.execCfg.Settings.SV),

Given that this cluster setting is only used in this package, should we move it here instead? Seems like a property of the SQL server, rather than the span config manager.


pkg/server/server_sql.go, line 1053 at r9 (raw file):

					spanconfig.CheckAndStartReconciliationJobInterval.Get(&s.execCfg.Settings.SV),
				)
				log.Infof(

Remove this log message.


pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r9 (raw file):

	started, err := m.createAndStartJobIfNoneExists(ctx)
	if err != nil {
		log.Errorf(ctx, "error starting auto span config reconciliation job: %v", err)

Return the error to the caller now that this isn't firing off an async task.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/server/server_sql.go, line 1033 at r9 (raw file):

Quoted 5 lines of code…
	// Start a background task that starts the auto span config reconciliation job.
	// The background task also periodically (as dictated by a cluster setting)
	// checks to ensure that the job exists. We don't expect this to happen, but
	// if does, we want to start it again.
	if err := s.stopper.RunAsyncTask(ctx, "start-span-config-reconciliation-job",

This feels out of place here. Why burden the server package with this logic? Can you instead hide this behind a Start method on the Manager struct.


pkg/server/server_sql.go, line 1037 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Use time.NewTicker instead. Also, should we attach a SetOnChange hook on the cluster setting to attempt job creation when changed? If the default duration is long, it could come in handy when trying to quickly bounce it. For upgrading clusters it's possible for the reconciliation job to only going to started 10m (or whatever the setting ends up being) after the upgrade has finalized. Is that what we want? I'm not sure what we could do -- maybe a similar cluster setting changed hook for the version setting? Or maybe a "migration" that also tries to idempotently call into this job creation hook?

I prefer a timeutil.Timer over either of those. It's a bit bulkier but strictly more flexible than either.


pkg/server/server_sql.go, line 1051 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Given that this cluster setting is only used in this package, should we move it here instead? Seems like a property of the SQL server, rather than the span config manager.

I feel more or less the opposite. Can we stop referencing it in the server package?

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/server/server_sql.go, line 1033 at r9 (raw file):

Previously, ajwerner wrote…
	// Start a background task that starts the auto span config reconciliation job.
	// The background task also periodically (as dictated by a cluster setting)
	// checks to ensure that the job exists. We don't expect this to happen, but
	// if does, we want to start it again.
	if err := s.stopper.RunAsyncTask(ctx, "start-span-config-reconciliation-job",

This feels out of place here. Why burden the server package with this logic? Can you instead hide this behind a Start method on the Manager struct.

Maybe I missed something, but from the discussion around synchronous vs. asynchronous API's I was under the impression that we wanted the caller of this method to be responsible for establishing the async task (as opposed to the async task being part of the API). Is that not what we want?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/server/server_sql.go, line 1033 at r9 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Maybe I missed something, but from the discussion around synchronous vs. asynchronous API's I was under the impression that we wanted the caller of this method to be responsible for establishing the async task (as opposed to the async task being part of the API). Is that not what we want?

What @nvanbenschoten said regarding testability was right. You did add a synchronous API. What I'm hanging on isn't how you structure the async part of this, whether it be some function you call or some struct that you start. What I'm hanging on is where that logic lives. The server layer here should be an integration and dependency injection layer. It can have business logic, sure, but that business logic should be about setting up the server given the setting, not as much the concerns of some subsystem itself.

@nvanbenschoten how far off on outlook are we in terms of what this integration should look like?

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/server/server_sql.go, line 1033 at r9 (raw file):

What I'm hanging on isn't how you structure the async part of this, whether it be some function you call or some struct that you start. What I'm hanging on is where that logic lives.

Maybe I misunderstood the ask to hide this behind a Start method on Manager. I was assuming that the stopper.RunAsyncTask would be part of the Start method. Are you instead suggesting putting just the periodic cluster version check + idempotent job queueing in the Start method and still calling that here inside stopper.RunAsyncTask?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 12 files at r8, 32 of 32 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/jobs/registry_test.go, line 128 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

It's worth improving the comment here to say as much (and capitalize "disable").

Agreed. Spell that all out. Otherwise, it will be impossible for anyone to ever convince themselves it's safe to remove this knob in the future, even if parts of the system change.


pkg/server/server_sql.go, line 1033 at r9 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

What I'm hanging on isn't how you structure the async part of this, whether it be some function you call or some struct that you start. What I'm hanging on is where that logic lives.

Maybe I misunderstood the ask to hide this behind a Start method on Manager. I was assuming that the stopper.RunAsyncTask would be part of the Start method. Are you instead suggesting putting just the periodic cluster version check + idempotent job queueing in the Start method and still calling that here inside stopper.RunAsyncTask?

I think what Andrew is saying is that the async task and the business logic of this retry loop should live inside of the spanconfig package (or really anywhere else but server). We removed the asynchrony from the Manager.StartJobIfNoneExists call to make the inner transaction easier to test, but that doesn't mean the retry logic needs to be fully externalized. We just wanted to split out the core logic in StartJobIfNoneExists from the task that was orchestrating it. But that orchestration logic still belongs somewhere inside of the spanconfig layer.

The way to look at this is that you want to push async processing as far out as possible, but no further.

It would be very reasonable to add a (*spanconfigmanager.Manager).Start method that launches an async task and calls (*spanconfigmanager.Manager).startJobIfNoneExists in this loop but then to test startJobIfNoneExists directly. That's kind of what you originally had, so maybe I should have just advised to you call createAndStartJobIfNoneExist directly from your unit tests.


pkg/spanconfig/spanconfigmanager/manager_test.go, line 101 at r5 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Can't use blocker because I'm reassigning it to unblocker and closing it to signal back to the first goroutine, but I removed this look and replaced it with a different channel. I don't think I need the sync.Once either because of the way the test is setup.

This still seems overly complex to me. First off, why do we need the mutex? Also, the blocker/unblocker/isBlocked stuff is tough to read and reason about. There's too much state. I think I see why you need two channels, but not the third. I imagine this could be simplified with the following sync primitives:

var firstCheck sync.Once
blockFirstC := make(chan struct{})
unblockFirstC := make(chan struct{})

...
ManagerAfterCheckedReconciliationJobExistsInterceptor: func(exists bool) {
    firstCheck.Do(func() {
      close(blockFirstC)
      <-unblockFirstC
    })
}

...
go func() {
    defer wg.Done()
    <-blockFirstC
    ...
    close(unblockFirstC)
}()

Also, nit, but consider using errgroup and then returning these t.Errorf as errors. Then you can require.NoError(t, wg.Wait()) at the end.

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/jobs/registry_test.go, line 128 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Agreed. Spell that all out. Otherwise, it will be impossible for anyone to ever convince themselves it's safe to remove this knob in the future, even if parts of the system change.

Done.


pkg/server/server_sql.go, line 1033 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think what Andrew is saying is that the async task and the business logic of this retry loop should live inside of the spanconfig package (or really anywhere else but server). We removed the asynchrony from the Manager.StartJobIfNoneExists call to make the inner transaction easier to test, but that doesn't mean the retry logic needs to be fully externalized. We just wanted to split out the core logic in StartJobIfNoneExists from the task that was orchestrating it. But that orchestration logic still belongs somewhere inside of the spanconfig layer.

The way to look at this is that you want to push async processing as far out as possible, but no further.

It would be very reasonable to add a (*spanconfigmanager.Manager).Start method that launches an async task and calls (*spanconfigmanager.Manager).startJobIfNoneExists in this loop but then to test startJobIfNoneExists directly. That's kind of what you originally had, so maybe I should have just advised to you call createAndStartJobIfNoneExist directly from your unit tests.

I confused myself a bit here, my bad. I think I have it now and this thing looks much nicer! Lemme know what y'all think


pkg/server/server_sql.go, line 1037 at r9 (raw file):

Previously, ajwerner wrote…

I prefer a timeutil.Timer over either of those. It's a bit bulkier but strictly more flexible than either.

Done.


pkg/server/server_sql.go, line 1043 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Return + log the error.

I don't think we want to return here, do we? We probably want to try again instead of turning the whole subsystem off, right?


pkg/server/server_sql.go, line 1051 at r9 (raw file):

Previously, ajwerner wrote…

I feel more or less the opposite. Can we stop referencing it in the server package?

Done.


pkg/server/server_sql.go, line 1053 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Remove this log message.

Done.


pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Return the error to the caller now that this isn't firing off an async task.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 34 files at r10, 32 of 32 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, and @RichardJCai)


pkg/spanconfig/spanconfigmanager/manager.go, line 85 at r12 (raw file):

			lastChecked := time.Time{}
			timer := timeutil.NewTimer()

defer timer.Stop()


pkg/spanconfig/spanconfigmanager/manager_test.go, line 81 at r12 (raw file):

					close(isBlocked)
					<-unblocker
				} else {

This else branch seems redundant. It's also pretty subtle that the count changes between when we first compare/increment and later on. Did you consider a structure along the lines of:

count++
curCount := count
if curCount {
    close(isBlocked)
    <-unblocker
}
...
if curCount <= 2 {
    ...

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the stuff Nathan pointed out and nits. Also needs a rebase. It will be ready to go in today.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, and @RichardJCai)


pkg/spanconfig/spanconfigmanager/manager.go, line 74 at r12 (raw file):

func (m *Manager) Start(ctx context.Context) error {
	return m.stopper.RunAsyncTask(ctx, "start-span-config-reconciliation-job",
		func(ctx context.Context) {

nit: buy yourself two layers of indentation by making a method on the Manager called run that takes a context and just pass it here.


pkg/spanconfig/spanconfigmanager/manager.go, line 178 at r12 (raw file):

Quoted 8 lines of code…
	stmt := fmt.Sprintf(`
SELECT EXISTS(
         SELECT job_id
           FROM [SHOW AUTOMATIC JOBS]
          WHERE job_type = '%s'
					AND status IN ('%s', '%s', '%s')
       );
`, jobspb.TypeAutoSpanConfigReconciliation.String(), jobs.StatusRunning, jobs.StatusPaused, jobs.StatusPending)

for the statuses how about use jobs.NonTerminalStatusTupleString?

This patch creates a new job type intended for span config
reconciliation. This job is hidden from the output of SHOW JOBS as it
is intended to be an AUTOMATIC job.

Release note: None
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased. I was speaking to @irfansharif offline and he mentioned that instead of defining a spanconfig.Manager interface we should just use the spanconfigmanager.Manager struct in server_sql.go -- the new revision includes that change as well.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, on to the meatier PRs

I was speaking to @irfansharif offline and he mentioned that instead of defining a spanconfig.Manager interface we should just use the spanconfigmanager.Manager struct in server_sql.go -- the new revision includes that change as well.

I like that very much.

Reviewed 1 of 35 files at r13, 7 of 33 files at r14.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/spanconfig/spanconfig.go, line 31 at r14 (raw file):

type Manager interface {
	Start(ctx context.Context) error

Now this can be removed from the interface.


pkg/spanconfig/spanconfigmanager/manager.go, line 188 at r14 (raw file):

Quoted 8 lines of code…

SELECT EXISTS(
         SELECT job_id
           FROM [SHOW AUTOMATIC JOBS]
          WHERE job_type = '%s'
					AND status IN %s
       );
`, jobspb.TypeAutoSpanConfigReconciliation.String(), jobs.NonTerminalStatusTupleString)

total nit: you can do this statically in a var.

This patch introduces the span config subsystem and its orchestrator,
the spanconfigmanager.Manager. The span config manager owns the process
of reconciling SQL zone configs to KV span configs. This includes
managing the span config reconciliation job and providing it access to
dependencies it needs to perform reconciliation.

The span config reconciliation job is intended to be a per tenant,
forever running job that is non-cancellable. At any point there should
be one (and only one) of these jobs running. The manager helps ensure
these semantics.

Even though we expect this job to be forever running (and therefore
never transition to a failed state), the manager on every sql pod
will periodically ensure that the job is indeed running. It will
restart a new job in case it finds that the job is not running. The
interval at which this happens is dictated by the
`spanconfig.reconciliation_job.start_interval` cluster
setting. It defaults to 10 minutes and is private for now.

This patch only adds the skeleton to encapsulate and plumb dependencies
to the job. The specific components themselves will come in subsequent
patches.

References cockroachdb#67679

Release note: None
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)


pkg/spanconfig/spanconfig.go, line 31 at r14 (raw file):

Previously, ajwerner wrote…

Now this can be removed from the interface.

Was meant to remove the interface entirely.

@arulajmani
Copy link
Collaborator Author

Thanks for the reviews on this one.

bors r=ajwerner,nvanbenschoten, irfansharif

@craig
Copy link
Contributor

craig bot commented Aug 18, 2021

Build succeeded:

@craig craig bot merged commit 9d76ce5 into cockroachdb:master Aug 18, 2021
@irfansharif
Copy link
Contributor

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants