Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: decouple job scheduler from 'ddl' and make it run/exit as owner changes #53548

Merged
merged 23 commits into from
May 30, 2024

Conversation

D3Hunter
Copy link
Contributor

@D3Hunter D3Hunter commented May 24, 2024

What problem does this PR solve?

Issue Number: ref #53246

Problem Summary:

What changed and how does it work?

  • decouple ddl job scheduler part out of ddl partially, some fields are still coupled, and make it a separate and cancellable module. it should make further test easier.
    • all context usage now changed to scheduler/worker context which is only valid during being owner from ddl context
    • current ddl context is non-cancellable, which might cause some issue when owner change, as it cannot be cancelled in time, it might keeps running until we hopefully met some isOwner check, so the duration of co-exist of 2 owner might be quite long.
  • after this pr, job scheduler related routine will only be start when current node is elected as owner. start SyncJobSchemaVerLoop only on owner node.
  • fix add-index disttask scheduler uses the 'ddlCtx.ctx', it should use the context we pass in.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    • tiup a cluster with 3 tidb
    • run a script to create 1000 tables with 10 threads, during this time we keeps delete keys of prefix /tidb/ddl/fg/owner/, to force owner change
    • ddl should be run success, and no abnormal output in logs.
mysql> select count(1) from INFORMATION_SCHEMA.tables where table_schema='ff_0';
+----------+
| count(1) |
+----------+
|     1000 |
+----------+
1 row in set (0.01 sec)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 24, 2024
Copy link

tiprow bot commented May 24, 2024

Hi @D3Hunter. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment on lines 95 to 96
// TODO getTableByTxn is using DDL ctx which is never cancelled except when shutdown.
// we should move this heavy operation out.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tangenta ptal

@@ -89,7 +88,7 @@ func (sch *BackfillingSchedulerExt) OnNextSubtasksBatch(
return nil, err
}
job := &backfillMeta.Job
tblInfo, err := getTblInfo(sch.d, job)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should framework context

@@ -382,8 +386,6 @@ type ddlCtx struct {
*waitSchemaSyncedController
*schemaVersionManager

runningJobs *runningJobs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to job scheduler

Comment on lines +549 to +552
// SeqNum is the total order in all DDLs, it's used to identify the order of
// moving the job into DDL history, not the order of the job execution.
// fast create table doesn't honor this field, there might duplicate seq_num in this case.
// TODO: deprecated it, as it forces 'moving jobs into DDL history' part to be serial.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@D3Hunter D3Hunter requested review from lance6716 and tangenta May 24, 2024 12:10
Copy link

codecov bot commented May 24, 2024

Codecov Report

Attention: Patch coverage is 90.27237% with 25 lines in your changes are missing coverage. Please review.

Project coverage is 74.5399%. Comparing base (29bf008) to head (441d476).

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #53548        +/-   ##
================================================
+ Coverage   72.4973%   74.5399%   +2.0425%     
================================================
  Files          1506       1506                
  Lines        430821     430910        +89     
================================================
+ Hits         312334     321200      +8866     
+ Misses        99116      89797      -9319     
- Partials      19371      19913       +542     
Flag Coverage Δ
integration 49.2816% <70.8171%> (?)
unit 71.4318% <86.3813%> (-0.0496%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9957% <ø> (ø)
parser ∅ <ø> (∅)
br 50.4487% <ø> (+8.6415%) ⬆️

@D3Hunter
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented May 27, 2024

@D3Hunter: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will review soon

pkg/owner/manager.go Outdated Show resolved Hide resolved
Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 / 19 files viewed

pkg/owner/manager_test.go Show resolved Hide resolved
tangenta
tangenta previously approved these changes May 27, 2024
pkg/ddl/job_table.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 27, 2024
@D3Hunter
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented May 27, 2024

@D3Hunter: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented May 28, 2024

@D3Hunter: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Contributor Author

the failure of TestSkipSchemaChecker is due to #53599

lance6716
lance6716 previously approved these changes May 28, 2024
@ti-chi-bot ti-chi-bot bot added the lgtm label May 28, 2024
Copy link

ti-chi-bot bot commented May 30, 2024

@D3Hunter: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-lightning-integration-test 95022b0 link true /test pull-lightning-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot added the approved label May 30, 2024
@D3Hunter D3Hunter requested a review from lance6716 May 30, 2024 05:24
@D3Hunter D3Hunter dismissed stale reviews from lance6716 and tangenta May 30, 2024 05:24

code changed

@ti-chi-bot ti-chi-bot bot removed the approved label May 30, 2024
Copy link

ti-chi-bot bot commented May 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tangenta, tiancaiamao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label May 30, 2024
Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 / 29 files viewed

reviewing

@@ -289,11 +289,6 @@ func (w *worker) runReorgJob(reorgInfo *reorgInfo, tblInfo *model.TableInfo,
if err != nil {
return errors.Trace(err)
}
case <-w.ctx.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we rely on doneCh to find w.ctx.Done(). How about change the signature of f to have a context parameter, and in runReorgJob use w.ctx as argument. So we have more confident to remove this case <-w.ctx.Done():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all current fs also calls through worker, it should be fine except that we still uses background context inside, will change it later

logutil.DDLLogger().Info("run reorg job quit")
d.removeReorgCtx(job.ID)
// We return dbterror.ErrWaitReorgTimeout here too, so that outer loop will break.
return dbterror.ErrWaitReorgTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error returned to caller is also changed, not sure about its effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous ctx only cancells when shutdown, this error shouldn't matter much

Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm. You can unhold yourself after address comments.

@D3Hunter
Copy link
Contributor Author

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2024
@ti-chi-bot ti-chi-bot bot merged commit 04c66ee into pingcap:master May 30, 2024
20 of 21 checks passed
@D3Hunter D3Hunter deleted the start-routine-on-owner branch May 30, 2024 06:38
@D3Hunter D3Hunter mentioned this pull request Jun 6, 2024
18 tasks
@ti-chi-bot ti-chi-bot added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Jul 16, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #54662.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jul 16, 2024
@ti-chi-bot ti-chi-bot added the needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. label Jul 30, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #55067.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jul 30, 2024
@D3Hunter D3Hunter removed needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. labels Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants