Skip to content

2.25.0.1500-b1

Summary:
At a high level, INSERT ON CONFLICT works as follows:

- For each value:
  - For each index:
    - If the value being inserted conflicts with a value in the index,
      run the ON CONFLICT part (either DO NOTHING or DO UPDATE).  Move
      on to the next value.
    - Else, continue.
  - Since all indexes do not conflict, INSERT normally.  (In case of
    upstream PG, if it fails to INSERT due to concurrent changes, it
    retries, but YB does not have that logic yet.)

This is bad performance for YB because for each value, it does index
reads and then a write (unless DO NOTHING is hit).  The alternating
reads and writes prevents buffering of write requests, so a lot of
back-and-forth RPCs are made.

Solve by implementing batching of the index reads during an INSERT ON
CONFLICT.  This allows writes to buffer up.  Add a GUC
yb_insert_on_conflict_read_batch_size to control how many rows to buffer
for each table.  1 disables and is the default.  Note that for
partitioned tables, this is the batch size for each partition.

Largely borrow off upstream PG's foreign table INSERT batching
implementation.

The flow goes as follows:

- For each value:
  - If batch size is reached, trigger batch flush
  - Store slot into in-memory list resultRelInfo->ri_Slots and similar
- Trigger batch flush for remainder slots

Batch flush goes as follows:

- For each index:
  - Read RPC to get all values matching the slots in this batch.  Store
    into in-memory map resultRelInfo->ri_YbConflictMaps[i].
- For each slot:
  - For each index:
    - If this slot matches something in the map, run the ON CONFLICT
      part.  Move on to the next slot.
    - Else, continue.
  - Since all indexes do not conflict, INSERT normally.

The map needs to be updated on ON CONFLICT DO UPDATE or normal INSERT
cases.  This involves changes to ExecInsertIndexTuples and
ExecDeleteIndexTuples, particularly to support the map updates for
primary key indexes.  Also, the map tracks rows that were just inserted
so that a double-insert error can be thrown, similar to upstream PG.
This is only done when detected within the same batch.  Otherwise, the
behavior matches non-batched YB behavior where it silently succeeds.

This feature is currently disabled in the following cases:

- non-YB relations
- catalog relations
- row triggers
- RETURNING clause

Detailed flow:

- ExecModifyTable
  - for (;;)
    - ExecProcNode (get a slot from input)
    - ExecInsert
      - switch to this slot's child resultRelInfo (for partitioned
        tables)
      - calculate generated columns
      - check permissions
      - if there's an ON CONFLICT clause
        - YbAddSlotToBatch
          - if batch is full
            - YbFlushSlotsFromBatch
          - add slot to ri_Slots, ri_PlanSlots
  - ExecPendingInserts
    - for each
      es_insert_pending_result_relations/es_insert_pending_modifytables
       - YbFlushSlotsFromBatch

This function is called in three places above:

- YbFlushSlotsFromBatch
  - if we just entered flushing mode
    - YbBatchFetchConflictingRows
      - ExecCheckIndexConstraints
        - for each index
          - if the index is not applicable (e.g. invalid, not part of
            arbiterIndexes)
            - continue
          - yb_batch_fetch_conflicting_rows
            - build map resultRelInfo->ri_YbConflictMaps[i]
  - while there are still slots to flush
    - YbExecCheckIndexConstraints
      - for each index
        - lookup map resultRelInfo->ri_YbConflictMaps[i]
        - if no match
          - continue
        - if match with just-inserted row
          - error
        - if match with existing row
          - return that there's a conflict
    - if the above check says there's a conflict
      - if DO UPDATE
        - ExecOnConflictUpdate
          - ExecUpdate
            - YBExecUpdateAct
              - (ExecCrossPartitionUpdate is disallowed)
              - YBCExecuteUpdateReplace/YBCExecuteUpdate
            - ExecUpdateEpilogue
              - ExecDeleteIndexTuples
                - for each index
                  - if the index is not applicable
                    - continue
                  - yb_index_delete (except PK index)
                  - update map resultRelInfo->ri_YbConflictMaps[i]
              - ExecInsertIndexTuples
                - for each index
                  - if the index is not applicable
                    - continue
                  - update map resultRelInfo->ri_YbConflictMaps[i]
                  - index_insert (except PK index)
              - AR triggers
      - else (DO NOTHING)
        - (nothing)
      - continue
    - YBCHeapInsert
    - ExecInsertIndexTuples
      - for each index
        - if the index is not applicable
          - continue
        - update map resultRelInfo->ri_YbConflictMaps[i]
        - index_insert (except PK index)
    - AR triggers
  - exit flushing mode
  - destroy all maps resultRelInfo->ri_YbConflictMaps

There are some behavior differences with batching enabled.  Within a
batch, when two rows map to the same key, we follow the PG semantics of
throwing an error.  Across batches, we follow the YB semantics of
silently applying both changes.  Moreover, when dealing with WITH
statements that modify the same table inside and outside of it, the ON
CONFLICT decision taking can vary depending on the batch size (see the
regress tests).
Jira: DB-13064

Test Plan:
On Almalinux 8:

    #!/usr/bin/env bash
    set -euo pipefail
    ./yb_build.sh fastdebug --gcc11
    find java/yb-pgsql/src/test/java/org/yb/pgsql -name 'TestPgRegressInsertOnConflict*' \
    | grep -oE 'TestPgRegress\w+' \
    | while read -r testname; do
      ./yb_build.sh fastdebug --gcc11 --java-test "$testname" --sj
    done

Jenkins: rebase: pg15

Reviewers: kramanathan, amartsinchyk

Reviewed By: kramanathan, amartsinchyk

Subscribers: smishra, yql, svc_phabricator

Differential Revision: https://phorge.dev.yugabyte.com/D36872
Assets 2
Loading