2.25.0.1500-b1
jaki
tagged this
28 Sep 07:17
Summary: At a high level, INSERT ON CONFLICT works as follows: - For each value: - For each index: - If the value being inserted conflicts with a value in the index, run the ON CONFLICT part (either DO NOTHING or DO UPDATE). Move on to the next value. - Else, continue. - Since all indexes do not conflict, INSERT normally. (In case of upstream PG, if it fails to INSERT due to concurrent changes, it retries, but YB does not have that logic yet.) This is bad performance for YB because for each value, it does index reads and then a write (unless DO NOTHING is hit). The alternating reads and writes prevents buffering of write requests, so a lot of back-and-forth RPCs are made. Solve by implementing batching of the index reads during an INSERT ON CONFLICT. This allows writes to buffer up. Add a GUC yb_insert_on_conflict_read_batch_size to control how many rows to buffer for each table. 1 disables and is the default. Note that for partitioned tables, this is the batch size for each partition. Largely borrow off upstream PG's foreign table INSERT batching implementation. The flow goes as follows: - For each value: - If batch size is reached, trigger batch flush - Store slot into in-memory list resultRelInfo->ri_Slots and similar - Trigger batch flush for remainder slots Batch flush goes as follows: - For each index: - Read RPC to get all values matching the slots in this batch. Store into in-memory map resultRelInfo->ri_YbConflictMaps[i]. - For each slot: - For each index: - If this slot matches something in the map, run the ON CONFLICT part. Move on to the next slot. - Else, continue. - Since all indexes do not conflict, INSERT normally. The map needs to be updated on ON CONFLICT DO UPDATE or normal INSERT cases. This involves changes to ExecInsertIndexTuples and ExecDeleteIndexTuples, particularly to support the map updates for primary key indexes. Also, the map tracks rows that were just inserted so that a double-insert error can be thrown, similar to upstream PG. This is only done when detected within the same batch. Otherwise, the behavior matches non-batched YB behavior where it silently succeeds. This feature is currently disabled in the following cases: - non-YB relations - catalog relations - row triggers - RETURNING clause Detailed flow: - ExecModifyTable - for (;;) - ExecProcNode (get a slot from input) - ExecInsert - switch to this slot's child resultRelInfo (for partitioned tables) - calculate generated columns - check permissions - if there's an ON CONFLICT clause - YbAddSlotToBatch - if batch is full - YbFlushSlotsFromBatch - add slot to ri_Slots, ri_PlanSlots - ExecPendingInserts - for each es_insert_pending_result_relations/es_insert_pending_modifytables - YbFlushSlotsFromBatch This function is called in three places above: - YbFlushSlotsFromBatch - if we just entered flushing mode - YbBatchFetchConflictingRows - ExecCheckIndexConstraints - for each index - if the index is not applicable (e.g. invalid, not part of arbiterIndexes) - continue - yb_batch_fetch_conflicting_rows - build map resultRelInfo->ri_YbConflictMaps[i] - while there are still slots to flush - YbExecCheckIndexConstraints - for each index - lookup map resultRelInfo->ri_YbConflictMaps[i] - if no match - continue - if match with just-inserted row - error - if match with existing row - return that there's a conflict - if the above check says there's a conflict - if DO UPDATE - ExecOnConflictUpdate - ExecUpdate - YBExecUpdateAct - (ExecCrossPartitionUpdate is disallowed) - YBCExecuteUpdateReplace/YBCExecuteUpdate - ExecUpdateEpilogue - ExecDeleteIndexTuples - for each index - if the index is not applicable - continue - yb_index_delete (except PK index) - update map resultRelInfo->ri_YbConflictMaps[i] - ExecInsertIndexTuples - for each index - if the index is not applicable - continue - update map resultRelInfo->ri_YbConflictMaps[i] - index_insert (except PK index) - AR triggers - else (DO NOTHING) - (nothing) - continue - YBCHeapInsert - ExecInsertIndexTuples - for each index - if the index is not applicable - continue - update map resultRelInfo->ri_YbConflictMaps[i] - index_insert (except PK index) - AR triggers - exit flushing mode - destroy all maps resultRelInfo->ri_YbConflictMaps There are some behavior differences with batching enabled. Within a batch, when two rows map to the same key, we follow the PG semantics of throwing an error. Across batches, we follow the YB semantics of silently applying both changes. Moreover, when dealing with WITH statements that modify the same table inside and outside of it, the ON CONFLICT decision taking can vary depending on the batch size (see the regress tests). Jira: DB-13064 Test Plan: On Almalinux 8: #!/usr/bin/env bash set -euo pipefail ./yb_build.sh fastdebug --gcc11 find java/yb-pgsql/src/test/java/org/yb/pgsql -name 'TestPgRegressInsertOnConflict*' \ | grep -oE 'TestPgRegress\w+' \ | while read -r testname; do ./yb_build.sh fastdebug --gcc11 --java-test "$testname" --sj done Jenkins: rebase: pg15 Reviewers: kramanathan, amartsinchyk Reviewed By: kramanathan, amartsinchyk Subscribers: smishra, yql, svc_phabricator Differential Revision: https://phorge.dev.yugabyte.com/D36872