Skip to content

2024.2.1.0-b160

Summary:
Commit 59b3636f505a7dc75ae91ab76348b20a89528e90 introduces read batching
for INSERT ON CONFLICT.  Commit 3f2ac8c27617a8775320f4edb0c829f059bd1eff
refactors the map part of the design.  This commit refactors the
ExecInsert part of the design and introduces RETURNING support.

- Decoupling of YB's read batching from PG's foreign data wrapper
  batching: before this change, the INSERT ON CONFLICT read batching
  flow sticks closely to the INSERT FDW batching.

  - Slots were added to per-relinfo buffers, and flushing happened
    whenever any of these buffers hit the batch size.  This caused some
    inconsistency in the order rows were inserted, and depending on the
    batch size, it could give different results.  Change these buffers
    to be per-plan so that an INSERT that gets routed to multiple
    partitioned tables still maintains insertion order.  One unobvious
    behavior change is that, now, in case of a mixed stream of input
    values, the write RPCs will increase as each batch touches a lot
    more tables whereas previously, each batch touched one table.
    Furthermore, since only one buffer exists per plan now rather than
    one buffer per partitioned table, the memory usage of storing these
    buffers reduces by that much.
  - The flushing also used to happen after we exceeded the batch size,
    not beforehand.  For example, if the batch size is 3, flushing
    happened when trying to insert the fourth value.  This causes some
    headache when trying to support RETURNING as that fourth value is
    already consumed from input while we are trying to flush the
    previous first to third values and trying to return those values.
    Restructure the flow to remove this odd behavior: now, the flush
    would happen after batching the third value, before reading the
    fourth value.  This involves a major code refactor, splitting
    ExecInsert into YbExecInsertPrologue and YbExecInsertAct.  The
    previous code duplicated part of ExecInsert with
    YbFlushSlotsFromBatch.  Now, that is deduplicated by
    YbExecInsertAct, which is called from both ExecInsert and
    YbFlushSlotsFromBatch.  While there is this deduplication, this
    restructuring introduces some duplication elsewhere, such as
    rechecking indexes whether they should be considered for ON CONFLICT
    between YbBatchFetchConflictingRows and ExecCheckIndexConstraints.
    Hopefully, these are minor enough that they can be deduplicated in
    the future, and the bigger goal is to reuse the pieces of this new
    structure for supporting read batching of the MERGE command in the
    future.
  - INSERT FDW batching calls ExecPendingInserts to flush _all_ buffers,
    including those for other tables in other plan nodes (see
    estate->es_insert_pending_result_relations), right before a BEFORE
    INSERT or BEFORE UPDATE trigger or when the current plan node's
    ExecModifyTable is exausted and we are flushing out the remainder.
    This is not necessarily correct if we want to stick close to
    non-batched behavior.  Neither is it always correct to flush only
    the current buffer and ignore other plan nodes' buffers.  This is
    because, in the lifetime of a batching cycle, we could get context
    switched out during the batching phase (which calls for flushing
    other plan nodes' buffers) or during the flushing phase (which calls
    for not flushing other plan nodes' buffers).  Two simple examples of
    these cases is a WITH CTE getting referenced (for the first time) in
    the values being inserted or it getting referenced (for the first
    time) in an ON CONFLICT DO UPDATE expression, respectively.  Since
    it is not simple to always determine which buffers ought to be
    flushed, default to flushing only the current buffer as that is
    simpler to implement (don't have to worry about infinite loops).
  - Stop reusing the fields that INSERT FDW batching uses for keeping
    track of batching, such as ri_NumSlots and ri_Slots, except for
    ri_BatchSize.  Store the rest of the information in a new struct
    YbInsertOnConflictBatchState.

- RETURNING support: besides the restructuring to make INSERT ON
  CONFLICT read batching with RETURNING easier to support, make several
  additional notable changes:

  - To support context switching out after having consumed a value from
    input, store the slot and plan slot in new fields in
    YbInsertOnConflictBatchState called pickup_slot and
    pickup_plan_slot.  Once flushing iscompleted, context is restored
    using these pickup slots.
  - Change the conflict map to be per-plan instead of global.  This is
    necessary in case we get context-switched out during flushing phase
    and interact with the conflict map of a different plan.  Keep the
    intent set global as we want to detect whether the same row is
    inserted twice across plans.  Truthfully, such detection should not
    be fully global as it should not mix into plans created by triggers,
    but that can be loosened in the future.  Along the same lines, there
    are other deficiencies with the map design such as how it is not
    updated when a plan does INSERT, UPDATE, or DELETE without involving
    INSERT ON CONFLICT as the map/set alteration only happens within
    INSERT ON CONFLICT code.  Maybe it is better to go back to design
    closer to before 3f2ac8c27617a8775320f4edb0c829f059bd1eff where the
    map/set updates happen in the index INSERT/DELETE/UPDATE paths to
    capture all the situations.  Push these finer details to be future
    work.

Add a bunch of test coverage for these finer cases involving WITH CTEs
and triggers.  Initial code by Karthik; finalizing design and adding
tests by me.
Jira: DB-13712

[BACKPORT 2024.2][#25296] YSQL: remove bad assertion in INSERT ON CONFLICT

In an INSERT ON CONFLICT read batching code path, there is an assertion
that can fail:

    Assert(state->num_slots >= key_count);

state->num_slots represents the number of input slots being processed
this batch.  This is typically equal to the batch size.  key_count
represents the number of keys read from the indexes.  Note index"es".
Assuming there's only one index, worst case, key_count equals num_slots.
But if there is more than one index, the worst case is that many times
num_slots.  So this assertion is faulty.  Remove it.
Jira: DB-14494

Merge:
- YbExecInsertPrologue
  - proute: not relevant in PG 11.2
  - BR trigger: ExecPendingInserts does not exist here in this PG 11.2
    based version, so disregard that.  Elsewhere, follow the same spirit
    of changing slot to *slot.
  - generated columns: does not exist in PG 11.2, so disregard
- ExecInsert
  - signature: preserve existing comment and signature, adding the one
    YB comment.
  - fdw: disregard since PG 11.2 doesn't support fdw batching.
  - tableOid: keep for PG 11.2.  Disregard generated columns comment.
- ExecPendingInserts: remove entirely
- ExecDeletePrologue, ExecDeleteAct, ExecDeleteEpilogue: rm
- ExecUpdatePrologue, ExecUpdatePrepareSlot, ExecUpdateAct,
  YBExecUpdateAct, ExecUpdateEpilogue,
  ExecCrossPartitionUpdateForeignKey: rm.  It merge conflicts because
  YBExecUpdateAct modifies YbExecCheckIndexConstraints function call.
  Apply that same change to the corresponding call in ExecUpdate.
- YbAddSlotToBatch: if (resultRelInfo->ri_NumSlots ==
  resultRelInfo->ri_BatchSize): this is relevant to fdw code, so
  discard.
- ExecInitModifyTable: add comment related to ri_BatchSize like in
  master.
- Introduce YbInsertOnConflictBatchState.tableOids because PG 15 -> PG
  11 loses table oid information from the slots.
- Put planSlot in ModifyTableContext.
- forming tuple: we want to do this in YbExecInsertAct, so since we
  materialize tuple there, no need to do so after BR trigger
- change foreach_delete_current to list_delete_cell (see upstream PG
  1cff1b95ab6ddae32faa3efe0d95a820dbfdc164).
- bunch of management of estate->es_result_relation_info: PG 15 -> PG 11
  means losing upstream PG a04daa97a4339c38e304cd6164d37da540d665a8, so
  resultRelInfo is passed around through estate.
- yb_insert_on_conflict regress test: update output for

      Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.

- yb_insert_on_conflict_multiplan regress test: update output to

      ERROR:  trigger "parttrigtrig_odd2" for relation "parttrigtab_odd" already exists

  (Likely the ordering of which error hits first changed.)

Test Plan:
On Almalinux 8:

    #!/usr/bin/env bash
    set -euo pipefail
    ./yb_build.sh fastdebug --gcc11
    find java/yb-pgsql/src/test/java/org/yb/pgsql -name 'TestPgRegressInsertOnConflict*' \
    | grep -oE 'TestPgRegress\w+' \
    | while read -r testname; do
      ./yb_build.sh fastdebug --gcc11 --java-test "$testname" --sj
    done

Co-authored-by: Karthik Ramanathan <[email protected]>
Original commits:
- a120c2588cab65ebe497e0b8c64bd826b54a99e1 / D39229
- 429547f83f9f62b40a4ee6b291d7f28b3bd63511 / D40759

Reviewers: kramanathan

Reviewed By: kramanathan

Subscribers: amartsinchyk

Differential Revision: https://phorge.dev.yugabyte.com/D40839
Assets 2
Loading