Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

relay: don't purge relay logs when connected to last source #1400

Merged
merged 19 commits into from
Feb 4, 2021

Conversation

lance6716
Copy link
Collaborator

@lance6716 lance6716 commented Jan 26, 2021

What problem does this PR solve?

close #1398

What is changed and how it works?

Purge relay log directory when switch to different upstream.

If reconnected to same upstream, relay will look at position saved in relay meta and needed position which specified by user or saved in syncer checkpoint. If relay.meta is left behind, purge the relay folder

Check List

Tests

  • Integration test
  • Manual test (check in changed integration test, relay-dir has indeed contained "mysql-bin.000001")

Code changes

Side effects

Related changes

  • Need to cherry-pick to the release branch
  • Need to be included in the release note

@lance6716 lance6716 added the status/WIP This PR is still work in progress label Jan 26, 2021
@lance6716 lance6716 added needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/WIP This PR is still work in progress labels Jan 27, 2021
@lance6716 lance6716 changed the title [WIP] relay: don't purge relay logs when connected to last source relay: don't purge relay logs when connected to last source Jan 27, 2021
@lance6716
Copy link
Collaborator Author

PTAL @3pointer @lichunzhu

@lance6716 lance6716 added status/DNM Do not merge, test is failing or blocked by another PR and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Jan 28, 2021
@lance6716
Copy link
Collaborator Author

lance6716 commented Jan 28, 2021

Done. But if stop source when worker is offline, need to manually purge relay-dir


now we should check if a source is deleted, changed some information like 'relay-binlog-gtid' and restarted, DM should handle it properly.

maybe (*realRelayHolder).stopRelay or other logic should remove relay-dir.

@lance6716 lance6716 added status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/DNM Do not merge, test is failing or blocked by another PR labels Jan 28, 2021
@GMHDBJD
Copy link
Collaborator

GMHDBJD commented Jan 28, 2021

worker1<-source1   crash
worker2<-source1   crash
worker1<-source1   start pull from max(checkpoint, recoverd relay meta)'s binlog

seems this pr will always start pull the binlog files from relay.meta?

@lance6716
Copy link
Collaborator Author

worker1<-source1   crash
worker2<-source1   crash
worker1<-source1   start pull from max(checkpoint, recoverd relay meta)'s binlog

seems this pr will always start pull the binlog files from relay.meta?

yes, I want to ensure relay log is complete. if upstream has purged the binlog at the position of relay.meta, I am expecting DM auto switch to remote binlog.

Comment on lines 237 to 241
// purge relay dir when delete source
if err := h.relay.PurgeRelayDir(); err != nil {
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit aggressive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to change 🤔

Copy link
Contributor

@lichunzhu lichunzhu Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving relay dir to another name (relay_dir -> relay_dir_deprecated) instead of deleting it directly. We have purge-relay command and I think the purge action should only happen in this command.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relay_dir_deprecated is out of relay_dir thus can't be deleted using purge-relay. And disk space is not freed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea. stop source is a normal operation. shouldn't purge data or move to another directory.

dm/worker/server.go Outdated Show resolved Hide resolved
relay/relay.go Show resolved Hide resolved
tests/relay_interrupt/run.sh Outdated Show resolved Hide resolved
@GMHDBJD
Copy link
Collaborator

GMHDBJD commented Jan 29, 2021

yes, I want to ensure relay log is complete. if upstream has purged the binlog at the position of relay.meta, I am expecting DM auto switch to remote binlog.

Haven't implemented now? I'm afraid that it can't meet user expectations and hard to locate the error since behavior has changed. Moreover, sync should wait relay for a long time if it start from the meta pos but not checkpoint.

@lance6716
Copy link
Collaborator Author

lance6716 commented Jan 29, 2021

yes, I want to ensure relay log is complete. if upstream has purged the binlog at the position of relay.meta, I am expecting DM auto switch to remote binlog.

Haven't implemented now? I'm afraid that it can't meet user expectations and hard to locate the error since behavior has changed. Moreover, sync should wait relay for a long time if it start from the meta pos but not checkpoint.

the tests/relay_interrupt/run.sh changed in this PR contains the binlog type switching test, so I think it's implemented.

yes, sync will wait a longer time. My prefer for a complete relay log is to handle this case: if an upstream/worker handles multiple tasks, later added tasks may access the hole of relay log thus lost events

@GMHDBJD
Copy link
Collaborator

GMHDBJD commented Jan 29, 2021

the tests/relay_interrupt/run.sh changed in this PR contains the binlog type switching test, so I think it's implemented.

That happens when sync meet errors, not relay. If relay meet an error, like binlog has been purged, sync will block now.

yes, sync will wait a longer time. My prefer for a complete relay log is to handle this case: if an upstream/worker handles multiple tasks, later added tasks may access the hole of relay log thus lost events

I prefer to check the checkpoint and relay meta. When worker bound to a source, compare the min checkpoint of the source and the relay meta location. If checkpoint is larger, purge relay dir and start from checkpoint(This is same as we specify relay-binlog-pos/gtid for a new source), if checkpoint is smaller, pull binlog from relay meta location.

If user add a new task for that source which older than relay, we just report an error. That's same as user specify relay-binlog-pos/gtid, but start from a older location in syncer, or sync start from a location which has been purged in relay.

@lance6716
Copy link
Collaborator Author

That happens when sync meet errors, not relay. If relay meet an error, like binlog has been purged, sync will block now.

I think we should also switch to remote binlog when relay meets error?

lichunzhu
lichunzhu previously approved these changes Feb 2, 2021
Copy link
Contributor

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lance6716
Copy link
Collaborator Author

/run-all-tests

@lance6716 lance6716 dismissed lichunzhu’s stale review February 2, 2021 07:18

need another LGTM

@lance6716 lance6716 added the status/LGT1 One reviewer already commented LGTM label Feb 2, 2021
@lance6716 lance6716 added this to the v2.0.2 milestone Feb 2, 2021
Copy link
Contributor

@3pointer 3pointer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lance6716 lance6716 merged commit efa5b30 into pingcap:master Feb 4, 2021
ti-srebot pushed a commit to ti-srebot/dm that referenced this pull request Feb 4, 2021
@ti-srebot
Copy link

cherry pick to release-2.0 in PR #1428

@ti-srebot ti-srebot added already-cherry-pick-2.0 The related PR is already cherry-picked to release-2.0. Add this label once the PR is cherry-picked and removed needs-cherry-pick-release-2.0 This PR should be cherry-picked to release-2.0. Remove this label after cherry-picked to release-2.0 labels Feb 4, 2021
lance6716 pushed a commit that referenced this pull request Feb 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
already-cherry-pick-2.0 The related PR is already cherry-picked to release-2.0. Add this label once the PR is cherry-picked needs-update-release-note This PR should be added into release notes. Remove this label once the release notes are updated status/LGT1 One reviewer already commented LGTM status/PTAL This PR is ready for review. Add this label back after committing new changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't purge relay folder if worker bounded to same source after restart
5 participants