lightning: optimize local-backend duplicate detection, resolution and record policy #41629

sleepymole · 2023-02-21T08:26:13Z

Motivation

Currently, the workflow of duplicate detection is as follows:

During sending kv pairs to TiKV, detect conflicts and record them to local pebble db.
After data is imported, read conflicts from local pebble db and write them to table lightning_task_info.conflict_error_v1.
Detect conflicts from TiKV and record them to table lightning_task_info.conflict_error_v1 as well.
Read conflict records from table lightning_task_info.conflict_error_v1, and resolve all conflicts by deleting all related kv pairs.

The workflow of duplicate detection is not perfect, and there are some problems:

If there are too many conflicts, lightning will become very slow. This is because the conflict records are written to TiDB through the SQL interface.
duplicate-resolution only supports removing all duplicate records, but sometimes users want to keep one of them.
The option of local-backend duplicate detection is inconsistent with the option of tidb-backend, which is confusing to users.
Checksum and analyze are not performed after any conflict is detected, which reduces the reliability of the import.

For the above problems, we want to improve the duplicate detection function in the following ways:

Reuse on-duplicate option of tidb-backend, and deprecated the duplicate-resolution option of local-backend.
Implement replace and ignore strategies for local-backend duplicate detection, and the default strategy is replace.
Do not store all conflict records in TiDB, support setting max-error-records for local-backend duplicate detection.
Support checksum and analyze no matter whether there are conflicts or not.

Detailed design

To address the above problems, we are introducing a new design for duplicate detection. The new duplicate detection is decoupled from the import process. It will be performed before the import process, read all data from the source, and detect duplicate records. After the detection is completed, the import the process will be started and automatically skip the duplicate records during the encoding process.

ExternalSorter

// ExternalSorter is an interface for sorting key-value pairs in external storage.
// The key-value pairs are sorted by the key, duplicate keys are automatically removed.
type ExternalSorter interface {
	// NewWriter creates a new writer for writing key-value pairs before sorting.
	// Multiple writers can be opened and used concurrently.
	NewWriter(ctx context.Context) (Writer, error)
	// Sort sorts the key-value pairs written by the writer.
	// It should be called after all open writers are closed.
	Sort(ctx context.Context) error
	// IsSorted returns true if the key-value pairs are sorted, iterators are ready to create.
	IsSorted() bool
	// NewIterator creates a new iterator for iterating over the key-value pairs after sorting.
	// Multiple iterators can be opened and used concurrently.
	NewIterator(ctx context.Context) (Iterator, error)
	// Close releases all resources held by the sorter. It will not clean up the external storage,
	// so the sorter can recover from a crash.
	Close() error
	// CloseAndCleanup closes the external sorter and cleans up all resources created by the sorter.
	CloseAndCleanup() error
}

Pesudo code for duplicate detection

resultWriter := resultSorter.NewWriter()
defer resultWriter.Close()

workingSorter.Sort(ctx)
it := workingSorter.NewIterator()
defer it.Close()

var lastKey, lastKeyID []byte
for it.Seek(nil); it.Valid(); it.Next() {
        key, keyID, err := decodeInternalKey(it.UnsafeKey())
        if err != nil {
                return 0, err
        }
        if bytes.Equal(key, lastKey) {
              // Duplicate key found.
        }
        lastKey = append(lastKey[:0], key...)
        lastKeyID = append(lastKeyID[:0], keyID...)
}
resultSorter.Sort(ctx)
return numDups, nil

Development tasks

Add ExternalSorter interface and implement DiskSorter
Implement basic logic of duplicate detection
Implement replace and ignore policies for local-backend
Add max-error-records option for local-backend

The text was updated successfully, but these errors were encountered:

ref #41629

…art (#44812) (#44813) ref #41629

…art (#44812) ref #41629

…45122) ref #41629

ref #41629

…and table (#45471) ref #41629

sleepymole added the type/enhancement The issue or PR belongs to an enhancement. label Feb 21, 2023

sleepymole mentioned this issue Feb 21, 2023

lightning: optimize duplicate detection for non-incremental import #41327

Closed

12 tasks

sleepymole added the component/lightning This issue is related to Lightning of TiDB. label Feb 21, 2023

sleepymole changed the title ~~lightning: optimize duplicate detection for non-incremental import~~ lightning: optimize duplicate detection, resolution and record policy Feb 27, 2023

sleepymole changed the title ~~lightning: optimize duplicate detection, resolution and record policy~~ lightning: optimize local-backend duplicate detection, resolution and record policy Feb 27, 2023

sleepymole self-assigned this Feb 27, 2023

sleepymole closed this as completed Mar 2, 2023

sleepymole reopened this Mar 7, 2023

This was referenced Apr 20, 2023

util: introduce sortedmap #43246

Closed

util/extsort: introduce external sorter #43287

Merged

ti-chi-bot bot pushed a commit that referenced this issue Apr 27, 2023

util/extsort: introduce external sorter (#43287)

b21fb01

ref #41629

sleepymole mentioned this issue Apr 27, 2023

lightning: implement duplicate detector #43460

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue May 15, 2023

lightning: implement duplicate detector (#43460)

433ac5c

ref #41629

sleepymole mentioned this issue May 18, 2023

lightning: implement preprocess duplicate detection #42647

Merged

12 tasks

sleepymole mentioned this issue May 25, 2023

util/extsort: parallelize DiskSorter.Sort #44185

Merged

12 tasks

lance6716 mentioned this issue May 28, 2023

[WIP]lightning: implement the Writer of RawKV deduplication sorter #44220

Closed

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue May 29, 2023

lightning: implement preprocess duplicate detection (#42647)

cd46add

ref #41629

lance6716 mentioned this issue Jun 3, 2023

lightning: add error message for pre-deduplication #44317

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jun 7, 2023

lightning: add error message for pre-deduplication (#44317)

6bab55c

ref #41629

ti-chi-bot bot pushed a commit that referenced this issue Jun 9, 2023

util/extsort: parallelize DiskSorter.Sort (#44185)

d05308f

ref #41629

lance6716 mentioned this issue Jun 15, 2023

lightning: mask pre-deduplication for local backend #44702

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jun 15, 2023

lightning: mask pre-deduplication for local backend (#44702)

b4db489

ref #41629

This was referenced Jun 20, 2023

lightning: disable pre-deduplication test after we disable the code part #44812

Merged

lightning: disable pre-deduplication test after we disable the code part (#44812) #44813

Merged

ti-chi-bot bot pushed a commit that referenced this issue Jun 20, 2023

lightning: disable pre-deduplication test after we disable the code p…

e57e8ac

…art (#44812) (#44813) ref #41629

ti-chi-bot bot pushed a commit that referenced this issue Jun 21, 2023

lightning: disable pre-deduplication test after we disable the code p…

1bfdae8

…art (#44812) ref #41629

lance6716 mentioned this issue Jul 3, 2023

lightning: enable pre-deduplication and rename the recording table #45122

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jul 5, 2023

lightning: enable pre-deduplication and rename the recording table (#…

ee7f4fc

…45122) ref #41629

lance6716 mentioned this issue Jul 6, 2023

lightning: add Conflict section to config and refactor #45197

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jul 11, 2023

lightning: add Conflict section to config and refactor (#45197)

108cb92

ref #41629

lance6716 mentioned this issue Jul 13, 2023

lightning: fix CI and deprecate max-error.conflict #45349

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jul 14, 2023

lightning: fix CI and deprecate max-error.conflict (#45349)

2cf78c7

ref #41629

lance6716 mentioned this issue Jul 18, 2023

lightning: tidb backend will check conflict threshold #45394

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jul 19, 2023

lightning: tidb backend will check conflict threshold (#45394)

be9681f

ref #41629

lance6716 mentioned this issue Jul 20, 2023

lightning: "error" strategy outputs duplicate error in terminal, log and table #45471

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jul 20, 2023

lightning: "error" strategy outputs duplicate error in terminal, log …

9662254

…and table (#45471) ref #41629

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightning: optimize local-backend duplicate detection, resolution and record policy #41629

lightning: optimize local-backend duplicate detection, resolution and record policy #41629

sleepymole commented Feb 21, 2023 •

edited

Loading

lightning: optimize local-backend duplicate detection, resolution and record policy #41629

lightning: optimize local-backend duplicate detection, resolution and record policy #41629

Comments

sleepymole commented Feb 21, 2023 • edited Loading

Motivation

Detailed design

ExternalSorter

Pesudo code for duplicate detection

Development tasks

sleepymole commented Feb 21, 2023 •

edited

Loading