-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning: optimize local-backend duplicate detection, resolution and record policy #41629
Open
4 tasks done
Labels
component/lightning
This issue is related to Lightning of TiDB.
type/enhancement
The issue or PR belongs to an enhancement.
Comments
12 tasks
sleepymole
added
the
component/lightning
This issue is related to Lightning of TiDB.
label
Feb 21, 2023
sleepymole
changed the title
lightning: optimize duplicate detection for non-incremental import
lightning: optimize duplicate detection, resolution and record policy
Feb 27, 2023
sleepymole
changed the title
lightning: optimize duplicate detection, resolution and record policy
lightning: optimize local-backend duplicate detection, resolution and record policy
Feb 27, 2023
This was referenced Apr 20, 2023
12 tasks
12 tasks
12 tasks
12 tasks
12 tasks
12 tasks
This was referenced Jun 20, 2023
12 tasks
12 tasks
12 tasks
12 tasks
12 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
component/lightning
This issue is related to Lightning of TiDB.
type/enhancement
The issue or PR belongs to an enhancement.
Motivation
Currently, the workflow of duplicate detection is as follows:
lightning_task_info
.conflict_error_v1
.lightning_task_info
.conflict_error_v1
as well.lightning_task_info
.conflict_error_v1
, and resolve all conflicts by deleting all related kv pairs.The workflow of duplicate detection is not perfect, and there are some problems:
For the above problems, we want to improve the duplicate detection function in the following ways:
on-duplicate
option of tidb-backend, and deprecated theduplicate-resolution
option of local-backend.replace
andignore
strategies for local-backend duplicate detection, and the default strategy isreplace
.max-error-records
for local-backend duplicate detection.Detailed design
To address the above problems, we are introducing a new design for duplicate detection. The new duplicate detection is decoupled from the import process. It will be performed before the import process, read all data from the source, and detect duplicate records. After the detection is completed, the import the process will be started and automatically skip the duplicate records during the encoding process.
ExternalSorter
Pesudo code for duplicate detection
Development tasks
DiskSorter
replace
andignore
policies for local-backendmax-error-records
option for local-backendThe text was updated successfully, but these errors were encountered: