-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanconfig: introduce the span config manager #68522
Conversation
26bef00
to
a3fbea8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'll defer to others for actual the rubber stamp. My comments are mostly around adding more comments. I think it'd be very helpful to explain in text somewhere what various pieces we're looking to introduce over subsequent PRs and how we expect them to get stitched together. It'd make it easier+faster to review.
To that end, I'd make the commit messages a bit more verbose -- I think it's too little context unless you have it all paged in already. For e.g., explain that this job is running per-tenant, and what it'll eventually be slated to do. Also, refer to #67679. We could alternatively create an issue containing all the details + pointers to relevant bits in the prototype branch so we know the final state we're headed towards and how this intermediate PR gets us there.
Reviewed 5 of 5 files at r1, 23 of 23 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)
pkg/jobs/registry_test.go, line 128 at r2 (raw file):
Knobs: base.TestingKnobs{ SpanConfig: &spanconfig.TestingKnobs{ ManagerDisableJobCreation: true,
Why are we disabling the span config manager in this test? Maybe add an inline comment.
pkg/server/admin_test.go, line 1437 at r2 (raw file):
Knobs: base.TestingKnobs{ SpanConfig: &spanconfig.TestingKnobs{ ManagerDisableJobCreation: true,
Ditto. Out of curiosity -- is there anything else that's non-invasive-ish that we can do to generalize these tests to not need to use this knob?
pkg/server/server_sql.go, line 804 at r2 (raw file):
{ knobs, _ := cfg.TestingKnobs.SpanConfig.(*spanconfig.TestingKnobs) reconciliationMgr := spanconfigmanager.New(
Add some commentary here explaining how+why these things are stitched together.
pkg/spanconfig/spanconfig.go, line 17 at r2 (raw file):
// configs with the cluster's. type ReconciliationDependencies interface { // KVAccessor
If you're making mention of future interfaces here, add some additional commentary. This is not providing the most helpful context I think.
pkg/spanconfig/testing_knobs.go, line 18 at r2 (raw file):
// for testing. type TestingKnobs struct { ManagerDisableJobCreation bool
Add a comment, mention it's the auto reconciliation job that's not being created.
pkg/spanconfig/testing_knobs.go, line 24 at r2 (raw file):
} // ModuleTestingKnobs makes TestingKnobs a base.ModuleTestingKnobs.
// ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface.
pkg/spanconfig/spanconfigmanager/manager.go, line 28 at r2 (raw file):
) // Manager is responsible for ensuring that only one (and only one) span config
"that one (and only one)"
pkg/spanconfig/spanconfigmanager/manager.go, line 41 at r2 (raw file):
var _ spanconfig.ReconciliationDependencies = &Manager{} // New constructs a new reconciliation manager.
Drop the "reconciliation manager" language, I think this was a remnant from our earlier package structure with the zcfgreconciler guy.
pkg/spanconfig/spanconfigmanager/manager.go, line 61 at r2 (raw file):
} // StartJobIfNoneExists will create and start the span config reconciliation job
s/Exists/Exist?
pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r2 (raw file):
manager.StartJobIfNoneExists(ctx) testutils.SucceedsSoon(t, func() error {
Is there a way to cleanly test that the concurrent attempts don't do anything? Perhaps making sure there are no outstanding stopper tasks. I think with the test as is it's possible to flake your way into success if it completes before the other StartJobIfNoneExists calls get to do anything.
pkg/sql/exec_util.go, line 1022 at r2 (raw file):
// StartSpanConfigReconciliationJobHook is called during the SQL server // startup process, to idempotently create and start a span config
s/,/
s/reconciliation for the tenant if none exists/reconciliation job
pkg/sql/exec_util.go, line 1027 at r2 (raw file):
} // StartSpanConfigReconciliationJobHook is used to create and start the span
// StartSpanConfigReconciliationJobHook is used to idempotently create and start the span config reconciliation job.
pkg/sql/delegate/show_jobs.go, line 66 at r1 (raw file):
sqlStmt := fmt.Sprintf("%s %s %s", selectClause, whereClause, orderbyClause) if n.Block { // The AutoSpanConfigReconciliation job is never meant to complete so we
s/never meant/not meant
s/via the WHERE//
pkg/sql/delegate/show_jobs.go, line 67 at r1 (raw file):
if n.Block { // The AutoSpanConfigReconciliation job is never meant to complete so we // explicitly exclude it here via the WHERE clause. This is required because
I'm surprised we need this. What do you mean by not having control over what filter is being applied first? We're filtering out the automatic jobs in the subquery -- aren't we guaranteed to never see them here? Unless the user's trying SHOW AUTOMATIC JOBS WHEN COMPLETE, but even then -- do we want to be smart about it here and filter out an automatic job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very fast first pass
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
pkg/spanconfig/spanconfigmanager/manager.go, line 29 at r2 (raw file):
// Manager is responsible for ensuring that only one (and only one) span config // reconciliation job is ever created. It also provides the job with all
I don't think is ever created
is the right language. I would think instead it should be something like, is ever running.
pkg/spanconfig/spanconfigmanager/manager.go, line 83 at r2 (raw file):
} var startableJob *jobs.StartableJob
I don't think StartableJob
is what you're looking for. It's a bit of a hack to get some output into the caller's goroutine. I think you just want a regular ol' job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things mostly look good. I found myself agreeing with all of Irfan's comments and left a few of my own. It's exciting to see this!
Reviewed 5 of 5 files at r1, 23 of 23 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @irfansharif)
pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r2 (raw file):
// iff it hadn't been created already. func (m *Manager) StartJobIfNoneExists(ctx context.Context) { _ = m.stopper.RunAsyncTask(ctx, "create-and-start-reconciliation-job", func(ctx context.Context) {
Out of curiosity, is there a benefit to pulling all of createAndStartJobIfNoneExist
into an async task instead of just the job in cases where it doesn't yet exist? I don't think there's a performance concern here, and it's usually easier to use an API that is synchronous for as long as possible. Irfan's comment on the test demonstrates this.
EDIT: beyond what I said here, if you pushed the async part back, you could also return a bool indicating whether the job started or not. This would make testing a breeze compared to what you're doing now.
pkg/spanconfig/spanconfigmanager/manager.go, line 122 at r2 (raw file):
ctx context.Context, txn *kv.Txn, ) (exists bool, _ error) { const stmt = `
This query is transactional all the way down to the scan of system.jobs
, right? I belive so based on a reading of crdbInternalJobsTable
, but we should confirm, because, without that, the transaction in createAndStartJobIfNoneExist
isn't enough to provide mutual exclusion and guarantee only one job running at a time.
Is this worth a test? For instance, you could spin up two calls to createAndStartJobIfNoneExist
, let both get past this point and block in a testing hook, then let them proceed and assert that only one succeeds and the other txn restarts and notices the job. I guess what I'm asking for is a more deterministic version of TestManagerJobCreation
that doesn't rely on a lucky race to reveal bugs.
pkg/spanconfig/spanconfigmanager/manager.go, line 126 at r2 (raw file):
SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'
Do we want to hard code this 'AUTO SPAN CONFIG RECONCILIATION' string? Should it come from a const defined somewhere?
pkg/sql/planner.go, line 542 at r2 (raw file):
} // SpanConfigReconciliationJobDeps returns the spanconfig.ReconciliationJobDeps
nit: missing period.
pkg/sql/delegate/show_jobs.go, line 43 at r1 (raw file):
// Display all [only automatic] jobs without selecting specific jobs. if n.Automatic { typePredicate = fmt.Sprintf("job_type = '%s' OR job_type = '%s'",
Is now the time to abstract this by pulling out some kind of jobspb.AutomaticJobs
static array?
Even if not, use IN
here and NOT IN
below to make this a little harder to get wrong.
pkg/sql/delegate/show_jobs.go, line 67 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
I'm surprised we need this. What do you mean by not having control over what filter is being applied first? We're filtering out the automatic jobs in the subquery -- aren't we guaranteed to never see them here? Unless the user's trying SHOW AUTOMATIC JOBS WHEN COMPLETE, but even then -- do we want to be smart about it here and filter out an automatic job?
I had the same question. Isn't the filter above sufficient?
pkg/sql/logictest/testdata/logic_test/auto_span_config_reconciliation_job, line 20 at r2 (raw file):
# Ensure that the job is not cancelable. statement error not cancelable
Is there a pgcode attached to this?
a3fbea8
to
fadd9c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFAL!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)
pkg/jobs/registry_test.go, line 128 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Why are we disabling the span config manager in this test? Maybe add an inline comment.
Done.
pkg/server/admin_test.go, line 1437 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Ditto. Out of curiosity -- is there anything else that's non-invasive-ish that we can do to generalize these tests to not need to use this knob?
Done.
pkg/server/server_sql.go, line 804 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Add some commentary here explaining how+why these things are stitched together.
Done.
pkg/spanconfig/spanconfig.go, line 17 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
If you're making mention of future interfaces here, add some additional commentary. This is not providing the most helpful context I think.
Done.
pkg/spanconfig/testing_knobs.go, line 18 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Add a comment, mention it's the auto reconciliation job that's not being created.
Done.
pkg/spanconfig/testing_knobs.go, line 24 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
// ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface.
Done.
pkg/spanconfig/spanconfigmanager/manager.go, line 28 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
"that one (and only one)"
Done.
pkg/spanconfig/spanconfigmanager/manager.go, line 29 at r2 (raw file):
Previously, ajwerner wrote…
I don't think
is ever created
is the right language. I would think instead it should be something like, is ever running.
Done.
pkg/spanconfig/spanconfigmanager/manager.go, line 41 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Drop the "reconciliation manager" language, I think this was a remnant from our earlier package structure with the zcfgreconciler guy.
Done.
pkg/spanconfig/spanconfigmanager/manager.go, line 61 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/Exists/Exist?
"job" is singular, so it should be exists right?
pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Out of curiosity, is there a benefit to pulling all of
createAndStartJobIfNoneExist
into an async task instead of just the job in cases where it doesn't yet exist? I don't think there's a performance concern here, and it's usually easier to use an API that is synchronous for as long as possible. Irfan's comment on the test demonstrates this.EDIT: beyond what I said here, if you pushed the async part back, you could also return a bool indicating whether the job started or not. This would make testing a breeze compared to what you're doing now.
I changed a few things around here, most notably I switched from a StartableJob
to a regular job based on Andrew's comment. There isn't much to do after we've written to system.jobs
for regular jobs. I'm notifying the registry to pick this new job up but from what I understand this is merely a nudge and not required.
This NotifyToAdoptJobs
isn't blocking right now but it could be in the future as it sends things on an unbuffered channel. As such, I have this part in an async task for now. Is this overkill? Should I just make the whole thing synchronous?
pkg/spanconfig/spanconfigmanager/manager.go, line 122 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This query is transactional all the way down to the scan of
system.jobs
, right? I belive so based on a reading ofcrdbInternalJobsTable
, but we should confirm, because, without that, the transaction increateAndStartJobIfNoneExist
isn't enough to provide mutual exclusion and guarantee only one job running at a time.Is this worth a test? For instance, you could spin up two calls to
createAndStartJobIfNoneExist
, let both get past this point and block in a testing hook, then let them proceed and assert that only one succeeds and the other txn restarts and notices the job. I guess what I'm asking for is a more deterministic version ofTestManagerJobCreation
that doesn't rely on a lucky race to reveal bugs.
I changed TestManagerJobCreation
to look like what you described above. Let me know what you think
pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Is there a way to cleanly test that the concurrent attempts don't do anything? Perhaps making sure there are no outstanding stopper tasks. I think with the test as is it's possible to flake your way into success if it completes before the other StartJobIfNoneExists calls get to do anything.
Changed this test around to be a bit more deterministic, lemme know what you think.
pkg/sql/exec_util.go, line 1022 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/,/
s/reconciliation for the tenant if none exists/reconciliation job
Done.
pkg/sql/exec_util.go, line 1027 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
// StartSpanConfigReconciliationJobHook is used to idempotently create and start the span config reconciliation job.
Done.
pkg/sql/delegate/show_jobs.go, line 43 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is now the time to abstract this by pulling out some kind of
jobspb.AutomaticJobs
static array?Even if not, use
IN
here andNOT IN
below to make this a little harder to get wrong.
Done
pkg/sql/delegate/show_jobs.go, line 66 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/never meant/not meant
s/via the WHERE//
No longer applicable.
pkg/sql/delegate/show_jobs.go, line 67 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I had the same question. Isn't the filter above sufficient?
tldr: this is no longer a problem after @ajwerner's changes.
This codepath corresponds to a SHOW JOBS WHEN COMPLETE <select_stmt>
where the select statement returns a list of job IDs to wait for. This select statement is used to inform the whereClause
above. The way this query was written before the optimizer could decide to reorder the filters, producing a plan like:
• hash join (semi)
│ equality: (job_id) = (?column?)
│ right cols are key
│
├── • filter
│ │ filter: CASE finished IS NULL WHEN true THEN CASE pg_sleep(1.0) WHEN true THEN crdb_internal.force_retry('24:00:00') ELSE 0 END ELSE 0 END = 0
│ │
│ └── • virtual table
│ table: jobs@primary
│
└── • union
│ estimated row count: 2
│
├── • values
│ size: 1 column, 1 row
│
└── • values
size: 1 column, 1 row
So this query could end up getting stuck in a constant retry loop. With @ajwerner's changes that rework this to use a CTE this can no longer happen.
pkg/sql/logictest/testdata/logic_test/auto_span_config_reconciliation_job, line 20 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is there a pgcode attached to this?
The jobs subsystem doesn't return one when trying to cancel a non-cancellable job.
9fe366a
to
0db6ece
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to get some people in a room and make decisions on how to deal with this job stuff.
|
||
// OnFailOrCancel implements the jobs.Resumer interface. | ||
func (r *resumer) OnFailOrCancel(context.Context, interface{}) error { | ||
return errors.AssertionFailedf("span config reconciliation job can never fail or be canceled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💵 says we get this sentry report before the end of 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I don't know how these things end up getting reported. Maybe no report because the jobs system doesn't send them.
|
||
var job *jobs.Job | ||
if err := m.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { | ||
exists, err := m.checkIfReconciliationJobExists(ctx, txn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/server/server_sql.go
Outdated
@@ -996,6 +1015,9 @@ func (s *SQLServer) preStart( | |||
return err | |||
} | |||
|
|||
// Create and start the span config reconciliation job if none exist. | |||
_ = s.execCfg.StartSpanConfigReconciliationJobHook(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a blocking call, I don't know that it belongs in preStart here.
// StartJobIfNoneExists will create and start the span config reconciliation job | ||
// iff it hadn't been created already. Returns a boolean indicating if the job | ||
// was started. | ||
func (m *Manager) StartJobIfNoneExists(ctx context.Context) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this synchronous method is really that helpful. What if we just had this thing sit in a loop and periodically, like every hour or something (after startup) go and run the check. If you need a synchronous hook, you can have it poke the loop or something. My guess is that the synchronous thing is handy in testing. Something smells about it being called in preStart.
// TODO(zcfg-pod): Upcoming PRs will actually make use of these reconciliation | ||
// dependencies. | ||
_ = rc | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that the implementation of reconciliation will happen here or is is the idea that there will be another injected piece of machinery that gets invoked?
pkg/server/server_sql.go
Outdated
@@ -996,6 +1015,9 @@ func (s *SQLServer) preStart( | |||
return err | |||
} | |||
|
|||
// Create and start the span config reconciliation job if none exist. | |||
_ = s.execCfg.StartSpanConfigReconciliationJobHook(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, doesn't this need version gating? We can't run this until after we've activated a version where all of the nodes know about this job type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 26 files at r3, 24 of 25 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @RichardJCai)
pkg/jobs/registry_test.go, line 128 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Done.
What do you mean by "noise" here? Does the test fail if we don't disable the job creation?
pkg/jobs/jobspb/wrap.go, line 69 at r5 (raw file):
// AutomaticJobTypes is a list of automatic job types that currently exist. var AutomaticJobTypes = [2]Type{TypeAutoCreateStats, TypeAutoSpanConfigReconciliation}
nit: s/2/.../
to avoid needing to update this length each time a new element is added.
pkg/spanconfig/spanconfigmanager/manager.go, line 122 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I changed
TestManagerJobCreation
to look like what you described above. Let me know what you think
I think it looks good. Thanks.
pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):
Previously, ajwerner wrote…
I'm not sure that this synchronous method is really that helpful. What if we just had this thing sit in a loop and periodically, like every hour or something (after startup) go and run the check. If you need a synchronous hook, you can have it poke the loop or something. My guess is that the synchronous thing is handy in testing. Something smells about it being called in preStart.
The synchronous nature of this method came from a previous comment. This was async back in r2, though still only single-shot. If we want to run this work inside of an async loop then I'd encourage us to push that outside of this API. Interfaces that are internally asynchronous are harder to test and harder to reason about. It means that we can't give any information back to the caller, like whether or not the job started. If a caller wants to call a synchronous method in an async task, it is free to.
pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r5 (raw file):
require.Equal(t, job.Payload().Description, "reconciling span configurations") }, ManagerAfterCheckedReconciliationJobExistsInterceptor: func(exists bool) {
We're going to call this twice during the first attempt, right? Because we expect a transaction retry. If so, we should assert that this is called exactly 3 times and that the first two passes exists=false
and the third passes exists=true
.
pkg/spanconfig/spanconfigmanager/manager_test.go, line 101 at r5 (raw file):
// Only try to start the job once the main testing goroutine has reached the // blocker and is waiting. for {
This loop would be better served using a channel or a condition variable. Some way to avoid spinning on the mutex.
Maybe this should <-blocker
and the other goroutine in the hook should close(blocker)
inside of a sync.Once
.
pkg/sql/delegate/show_jobs.go, line 59 at r5 (raw file):
predicate.WriteString("'") } predicate.WriteString(")")
Can we replace some of these single-character strings with predicate.WriteByte
? That is marginally more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The synchronous nature of this method came from a previous comment. This was async back in r2, though still only single-shot. If we want to run this work inside of an async loop then I'd encourage us to push that outside of this API. Interfaces that are internally asynchronous are harder to test and harder to reason about. It means that we can't give any information back to the caller, like whether or not the job started. If a caller wants to call a synchronous method in an async task, it is free to.
I get the desire to have the logic be implemented with a synchronous call somewhere for the reasons you indicate. What I'm pushing back against is the idea that we want that synchronous call to happen during server startup 100% of the time. Do we really want to scan the jobs table synchronously on server startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @RichardJCai)
pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):
Previously, ajwerner wrote…
I get the desire to have the logic be implemented with a synchronous call somewhere for the reasons you indicate. What I'm pushing back against is the idea that we want that synchronous call to happen during server startup 100% of the time. Do we really want to scan the jobs table synchronously on server startup?
What is the concern with this running on server startup? SQL migrations also run synchronously on server startup. Do you expect this to be expensive because we don't index on job_type
so this results in a full table scan? If so, that's a fair concern. And beyond this, if we need to periodically check if the job exists, then we'll need some task to do that anyway, so there's no need to also perform this synchronously at startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What is the concern with this running on server startup? SQL migrations also run synchronously on server startup. Do you expect this to be expensive because we don't index on
job_type
so this results in a full table scan? If so, that's a fair concern. And beyond this, if we need to periodically check if the job exists, then we'll need some task to do that anyway, so there's no need to also perform this synchronously at startup.
The scan concerns me, yes. Imagine that the jobs table is unavailable because this node cannot start up. I guess we have that problem for startup migrations too. Fortunately those things are (slowly) on their way out. I filed #68896 to index the type because I think generally it's a good idea. Whether we like it or not, there's a movement coming in the 22.1 time frame to cut the number of synchronous, global round trips during sql server startup to a minimum. I'm trying to get ahead of that.
I think having a periodic task to create the job is a good idea. I have fears of the job failing for various reasons and I have fears about people deleting it and everything breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/server/server_sql.go, line 1019 at r5 (raw file):
Previously, ajwerner wrote…
Also, doesn't this need version gating? We can't run this until after we've activated a version where all of the nodes know about this job type.
I was wrong. preStart
is effectively start for the sql server.
pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):
Previously, ajwerner wrote…
The scan concerns me, yes. Imagine that the jobs table is unavailable because this node cannot start up. I guess we have that problem for startup migrations too. Fortunately those things are (slowly) on their way out. I filed #68896 to index the type because I think generally it's a good idea. Whether we like it or not, there's a movement coming in the 22.1 time frame to cut the number of synchronous, global round trips during sql server startup to a minimum. I'm trying to get ahead of that.
I think having a periodic task to create the job is a good idea. I have fears of the job failing for various reasons and I have fears about people deleting it and everything breaking.
Err ignore me about the startup, this is after we start up the kvserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the periodic job creation attempt the remaining blocker for this PR?
Reviewed 1 of 26 files at r3, 12 of 25 files at r4, 1 of 3 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @nvanbenschoten, and @RichardJCai)
pkg/jobs/registry_test.go, line 128 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What do you mean by "noise" here? Does the test fail if we don't disable the job creation?
Yea, I'm still not sure why we need it (if we do).
pkg/server/admin_test.go, line 1418 at r5 (raw file):
rows := db.Query( t, fmt.Sprintf(
Do we need the fmt.Sprintf? db.Query can parametrize the query string with arguments.
pkg/server/admin_test.go, line 1508 at r5 (raw file):
{ "jobs", append(append([]int64{5, 4, 3, 2, 1}, existingSucceededIDs...), existingRunningIDs...),
This is "all jobs" right? Could instead define a expectedIDs = existingSucceededIDs + existingRunningIDs above.
pkg/server/server_sql.go, line 804 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Done.
"Instantiate a span config manager; it exposes a hook to idempotently create the span config reconciliation job and captures all relevant job dependencies."
pkg/server/server_sql.go, line 1019 at r5 (raw file):
Previously, ajwerner wrote…
I was wrong.
preStart
is effectively start for the sql server.
Looks like it fired up an async task internally, though not obvious looking at the signature. +1 to pulling up the async task + forever attempt to keep creating using a synchronous API here at the caller.
pkg/spanconfig/testing_knobs.go, line 18 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Done.
"ManagerDisableJobCreation disables creating the auto span config reconciliation job."
pkg/spanconfig/spanconfigjob/job.go, line 36 at r5 (raw file):
Previously, ajwerner wrote…
Is the idea that the implementation of reconciliation will happen here or is is the idea that there will be another injected piece of machinery that gets invoked?
The latter, one of the job dependencies will be simply emitting span config changes it observes through SQL. The job exists to simply issue RPCs to KV propagating those changes. It'll also checkpoint internally to make sure when resumed, it's doing the full sync.
pkg/spanconfig/spanconfigmanager/manager.go, line 65 at r5 (raw file):
Previously, ajwerner wrote…
Err ignore me about the startup, this is after we start up the kvserver.
Are you saying that we want one instance of the job running at a time, rather than only one job being created that runs forever? Makes sense, a small enough change, but mind explaining why that's preferable? Is it just safeguarding in case we accidentally write code resulting in the forever job failing/entering a terminal state, or us somehow purging the job away? Cool -- makes it a lot less fragile. We did in fact have bugs in our prototype where we returned errors in the job, and this makes it a lot more robust.
pkg/spanconfig/spanconfigmanager/manager_test.go, line 71 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Changed this test around to be a bit more deterministic, lemme know what you think.
Now that StartJobIfNoneExists returns whether or not it was a noop, can we simplify the test to check and see that there's only true returned from concurrent attempts? We could then validate by scanning the jobs table to make sure only one instance exists. I think it should obviate the need for the new testing interceptor and all these extra channels, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the periodic job creation attempt the remaining blocker for this PR?
Also version gating.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/spanconfig/spanconfigjob/job.go, line 36 at r5 (raw file):
Previously, irfansharif (irfan sharif) wrote…
The latter, one of the job dependencies will be simply emitting span config changes it observes through SQL. The job exists to simply issue RPCs to KV propagating those changes. It'll also checkpoint internally to make sure when resumed, it's doing the full sync.
Okay, I'll wait for the next PR to talk about what functionality lives where.
0db6ece
to
31973cb
Compare
1e57ee2
to
f5d1d74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 12 files at r8, 32 of 32 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
-- commits, line 14 at r9 ([raw file](https://github.com/cockroachdb/cockroach/blob/f5d1d74a0dfa4d86214a941d28a6396b2d970f11/-- commits#L14)):
Type in commit message (idempootently).
pkg/jobs/registry_test.go, line 128 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
This, and other tests in this file, directly modify
system.jobs
and then make assertions over the contents of it. For example,TestRegistryGCPagination
ensures that old terminal jobs are gc-ed away by asserting no entries are left insystem.jobs
. I could modify the test to add a special case for this auto span config reconciliation job or change the calls to useSHOW JOBS
instead of queryingsystem.jobs
directly, but I prefer disabling the job entirely considering these tests shouldn't need to bother with the nuances of this auto job.
It's worth improving the comment here to say as much (and capitalize "disable").
pkg/server/server_sql.go, line 1037 at r9 (raw file):
nextTick := timeutil.Now() for { nextTickCh := time.After(nextTick.Sub(timeutil.Now()))
Use time.NewTicker instead. Also, should we attach a SetOnChange hook on the cluster setting to attempt job creation when changed? If the default duration is long, it could come in handy when trying to quickly bounce it. For upgrading clusters it's possible for the reconciliation job to only going to started 10m (or whatever the setting ends up being) after the upgrade has finalized. Is that what we want? I'm not sure what we could do -- maybe a similar cluster setting changed hook for the version setting? Or maybe a "migration" that also tries to idempotently call into this job creation hook?
pkg/server/server_sql.go, line 1043 at r9 (raw file):
// cluster version allows for it. if s.execCfg.Settings.Version.IsActive(ctx, clusterversion.AutoSpanConfigReconciliationJob) { _ = s.execCfg.StartSpanConfigReconciliationJobHook(ctx)
Return + log the error.
pkg/server/server_sql.go, line 1051 at r9 (raw file):
} nextTick = nextTick.Add( spanconfig.CheckAndStartReconciliationJobInterval.Get(&s.execCfg.Settings.SV),
Given that this cluster setting is only used in this package, should we move it here instead? Seems like a property of the SQL server, rather than the span config manager.
pkg/server/server_sql.go, line 1053 at r9 (raw file):
spanconfig.CheckAndStartReconciliationJobInterval.Get(&s.execCfg.Settings.SV), ) log.Infof(
Remove this log message.
pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r9 (raw file):
started, err := m.createAndStartJobIfNoneExists(ctx) if err != nil { log.Errorf(ctx, "error starting auto span config reconciliation job: %v", err)
Return the error to the caller now that this isn't firing off an async task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/server/server_sql.go, line 1033 at r9 (raw file):
Quoted 5 lines of code…
// Start a background task that starts the auto span config reconciliation job. // The background task also periodically (as dictated by a cluster setting) // checks to ensure that the job exists. We don't expect this to happen, but // if does, we want to start it again. if err := s.stopper.RunAsyncTask(ctx, "start-span-config-reconciliation-job",
This feels out of place here. Why burden the server package with this logic? Can you instead hide this behind a Start
method on the Manager
struct.
pkg/server/server_sql.go, line 1037 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Use time.NewTicker instead. Also, should we attach a SetOnChange hook on the cluster setting to attempt job creation when changed? If the default duration is long, it could come in handy when trying to quickly bounce it. For upgrading clusters it's possible for the reconciliation job to only going to started 10m (or whatever the setting ends up being) after the upgrade has finalized. Is that what we want? I'm not sure what we could do -- maybe a similar cluster setting changed hook for the version setting? Or maybe a "migration" that also tries to idempotently call into this job creation hook?
I prefer a timeutil.Timer
over either of those. It's a bit bulkier but strictly more flexible than either.
pkg/server/server_sql.go, line 1051 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Given that this cluster setting is only used in this package, should we move it here instead? Seems like a property of the SQL server, rather than the span config manager.
I feel more or less the opposite. Can we stop referencing it in the server package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/server/server_sql.go, line 1033 at r9 (raw file):
Previously, ajwerner wrote…
// Start a background task that starts the auto span config reconciliation job. // The background task also periodically (as dictated by a cluster setting) // checks to ensure that the job exists. We don't expect this to happen, but // if does, we want to start it again. if err := s.stopper.RunAsyncTask(ctx, "start-span-config-reconciliation-job",
This feels out of place here. Why burden the server package with this logic? Can you instead hide this behind a
Start
method on theManager
struct.
Maybe I missed something, but from the discussion around synchronous vs. asynchronous API's I was under the impression that we wanted the caller of this method to be responsible for establishing the async task (as opposed to the async task being part of the API). Is that not what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/server/server_sql.go, line 1033 at r9 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Maybe I missed something, but from the discussion around synchronous vs. asynchronous API's I was under the impression that we wanted the caller of this method to be responsible for establishing the async task (as opposed to the async task being part of the API). Is that not what we want?
What @nvanbenschoten said regarding testability was right. You did add a synchronous API. What I'm hanging on isn't how you structure the async part of this, whether it be some function you call or some struct that you start. What I'm hanging on is where that logic lives. The server
layer here should be an integration and dependency injection layer. It can have business logic, sure, but that business logic should be about setting up the server given the setting, not as much the concerns of some subsystem itself.
@nvanbenschoten how far off on outlook are we in terms of what this integration should look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/server/server_sql.go, line 1033 at r9 (raw file):
What I'm hanging on isn't how you structure the async part of this, whether it be some function you call or some struct that you start. What I'm hanging on is where that logic lives.
Maybe I misunderstood the ask to hide this behind a Start
method on Manager
. I was assuming that the stopper.RunAsyncTask
would be part of the Start
method. Are you instead suggesting putting just the periodic cluster version check + idempotent job queueing in the Start
method and still calling that here inside stopper.RunAsyncTask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 12 files at r8, 32 of 32 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/jobs/registry_test.go, line 128 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
It's worth improving the comment here to say as much (and capitalize "disable").
Agreed. Spell that all out. Otherwise, it will be impossible for anyone to ever convince themselves it's safe to remove this knob in the future, even if parts of the system change.
pkg/server/server_sql.go, line 1033 at r9 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
What I'm hanging on isn't how you structure the async part of this, whether it be some function you call or some struct that you start. What I'm hanging on is where that logic lives.
Maybe I misunderstood the ask to hide this behind a
Start
method onManager
. I was assuming that thestopper.RunAsyncTask
would be part of theStart
method. Are you instead suggesting putting just the periodic cluster version check + idempotent job queueing in theStart
method and still calling that here insidestopper.RunAsyncTask
?
I think what Andrew is saying is that the async task and the business logic of this retry loop should live inside of the spanconfig
package (or really anywhere else but server
). We removed the asynchrony from the Manager.StartJobIfNoneExists
call to make the inner transaction easier to test, but that doesn't mean the retry logic needs to be fully externalized. We just wanted to split out the core logic in StartJobIfNoneExists
from the task that was orchestrating it. But that orchestration logic still belongs somewhere inside of the spanconfig
layer.
The way to look at this is that you want to push async processing as far out as possible, but no further.
It would be very reasonable to add a (*spanconfigmanager.Manager).Start
method that launches an async task and calls (*spanconfigmanager.Manager).startJobIfNoneExists
in this loop but then to test startJobIfNoneExists
directly. That's kind of what you originally had, so maybe I should have just advised to you call createAndStartJobIfNoneExist
directly from your unit tests.
pkg/spanconfig/spanconfigmanager/manager_test.go, line 101 at r5 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Can't use
blocker
because I'm reassigning it tounblocker
and closing it to signal back to the first goroutine, but I removed this look and replaced it with a different channel. I don't think I need thesync.Once
either because of the way the test is setup.
This still seems overly complex to me. First off, why do we need the mutex? Also, the blocker
/unblocker
/isBlocked
stuff is tough to read and reason about. There's too much state. I think I see why you need two channels, but not the third. I imagine this could be simplified with the following sync primitives:
var firstCheck sync.Once
blockFirstC := make(chan struct{})
unblockFirstC := make(chan struct{})
...
ManagerAfterCheckedReconciliationJobExistsInterceptor: func(exists bool) {
firstCheck.Do(func() {
close(blockFirstC)
<-unblockFirstC
})
}
...
go func() {
defer wg.Done()
<-blockFirstC
...
close(unblockFirstC)
}()
Also, nit, but consider using errgroup
and then returning these t.Errorf
as errors. Then you can require.NoError(t, wg.Wait())
at the end.
f5d1d74
to
fd0af6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFAL!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/jobs/registry_test.go, line 128 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Agreed. Spell that all out. Otherwise, it will be impossible for anyone to ever convince themselves it's safe to remove this knob in the future, even if parts of the system change.
Done.
pkg/server/server_sql.go, line 1033 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think what Andrew is saying is that the async task and the business logic of this retry loop should live inside of the
spanconfig
package (or really anywhere else butserver
). We removed the asynchrony from theManager.StartJobIfNoneExists
call to make the inner transaction easier to test, but that doesn't mean the retry logic needs to be fully externalized. We just wanted to split out the core logic inStartJobIfNoneExists
from the task that was orchestrating it. But that orchestration logic still belongs somewhere inside of thespanconfig
layer.The way to look at this is that you want to push async processing as far out as possible, but no further.
It would be very reasonable to add a
(*spanconfigmanager.Manager).Start
method that launches an async task and calls(*spanconfigmanager.Manager).startJobIfNoneExists
in this loop but then to teststartJobIfNoneExists
directly. That's kind of what you originally had, so maybe I should have just advised to you callcreateAndStartJobIfNoneExist
directly from your unit tests.
I confused myself a bit here, my bad. I think I have it now and this thing looks much nicer! Lemme know what y'all think
pkg/server/server_sql.go, line 1037 at r9 (raw file):
Previously, ajwerner wrote…
I prefer a
timeutil.Timer
over either of those. It's a bit bulkier but strictly more flexible than either.
Done.
pkg/server/server_sql.go, line 1043 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Return + log the error.
I don't think we want to return here, do we? We probably want to try again instead of turning the whole subsystem off, right?
pkg/server/server_sql.go, line 1051 at r9 (raw file):
Previously, ajwerner wrote…
I feel more or less the opposite. Can we stop referencing it in the server package?
Done.
pkg/server/server_sql.go, line 1053 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Remove this log message.
Done.
pkg/spanconfig/spanconfigmanager/manager.go, line 64 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Return the error to the caller now that this isn't firing off an async task.
Done.
fd0af6e
to
5e72733
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 34 files at r10, 32 of 32 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, and @RichardJCai)
pkg/spanconfig/spanconfigmanager/manager.go, line 85 at r12 (raw file):
lastChecked := time.Time{} timer := timeutil.NewTimer()
defer timer.Stop()
pkg/spanconfig/spanconfigmanager/manager_test.go, line 81 at r12 (raw file):
close(isBlocked) <-unblocker } else {
This else branch seems redundant. It's also pretty subtle that the count
changes between when we first compare/increment and later on. Did you consider a structure along the lines of:
count++
curCount := count
if curCount {
close(isBlocked)
<-unblocker
}
...
if curCount <= 2 {
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the stuff Nathan pointed out and nits. Also needs a rebase. It will be ready to go in today.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, and @RichardJCai)
pkg/spanconfig/spanconfigmanager/manager.go, line 74 at r12 (raw file):
func (m *Manager) Start(ctx context.Context) error { return m.stopper.RunAsyncTask(ctx, "start-span-config-reconciliation-job", func(ctx context.Context) {
nit: buy yourself two layers of indentation by making a method on the Manager
called run
that takes a context and just pass it here.
pkg/spanconfig/spanconfigmanager/manager.go, line 178 at r12 (raw file):
Quoted 8 lines of code…
stmt := fmt.Sprintf(` SELECT EXISTS( SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = '%s' AND status IN ('%s', '%s', '%s') ); `, jobspb.TypeAutoSpanConfigReconciliation.String(), jobs.StatusRunning, jobs.StatusPaused, jobs.StatusPending)
for the statuses how about use jobs.NonTerminalStatusTupleString
?
This patch creates a new job type intended for span config reconciliation. This job is hidden from the output of SHOW JOBS as it is intended to be an AUTOMATIC job. Release note: None
5e72733
to
39d5b1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased. I was speaking to @irfansharif offline and he mentioned that instead of defining a spanconfig.Manager
interface we should just use the spanconfigmanager.Manager
struct in server_sql.go
-- the new revision includes that change as well.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajwerner, @arulajmani, @Azhng, @irfansharif, @nvanbenschoten, and @RichardJCai)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was speaking to @irfansharif offline and he mentioned that instead of defining a spanconfig.Manager interface we should just use the spanconfigmanager.Manager struct in server_sql.go -- the new revision includes that change as well.
I like that very much.
Reviewed 1 of 35 files at r13, 7 of 33 files at r14.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/spanconfig/spanconfig.go, line 31 at r14 (raw file):
type Manager interface { Start(ctx context.Context) error
Now this can be removed from the interface.
pkg/spanconfig/spanconfigmanager/manager.go, line 188 at r14 (raw file):
Quoted 8 lines of code…
SELECT EXISTS( SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = '%s' AND status IN %s ); `, jobspb.TypeAutoSpanConfigReconciliation.String(), jobs.NonTerminalStatusTupleString)
total nit: you can do this statically in a var.
39d5b1f
to
c6731e6
Compare
This patch introduces the span config subsystem and its orchestrator, the spanconfigmanager.Manager. The span config manager owns the process of reconciling SQL zone configs to KV span configs. This includes managing the span config reconciliation job and providing it access to dependencies it needs to perform reconciliation. The span config reconciliation job is intended to be a per tenant, forever running job that is non-cancellable. At any point there should be one (and only one) of these jobs running. The manager helps ensure these semantics. Even though we expect this job to be forever running (and therefore never transition to a failed state), the manager on every sql pod will periodically ensure that the job is indeed running. It will restart a new job in case it finds that the job is not running. The interval at which this happens is dictated by the `spanconfig.reconciliation_job.start_interval` cluster setting. It defaults to 10 minutes and is private for now. This patch only adds the skeleton to encapsulate and plumb dependencies to the job. The specific components themselves will come in subsequent patches. References cockroachdb#67679 Release note: None
c6731e6
to
6b8eae4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @ajwerner, @arulajmani, @irfansharif, @nvanbenschoten, and @RichardJCai)
pkg/spanconfig/spanconfig.go, line 31 at r14 (raw file):
Previously, ajwerner wrote…
Now this can be removed from the interface.
Was meant to remove the interface entirely.
Thanks for the reviews on this one. bors r=ajwerner,nvanbenschoten, irfansharif |
Build succeeded: |
🎉 |
See individual commits for details.