[DocDB] Running ANALYZE on a large table in PgRegressPartitions.java test fails with "Operation failed. Try again.: Transaction aborted" #10989

deeps1991 · 2022-01-04T21:02:55Z

Description

The PgRegressPartitions test has been failing for a long time in the pg_partition_aggregate test with the following error:

Operation failed. Try again.: Transaction aborted: 4b7eb767-5d71-494e-8ba6-1fc2a930a24c

This always occurs for 2 tables: pagg_tab_ml and pagg_tab_para, specifically when ANALYZE is run on them. These are the only tables with a comparatively large number of rows (for a unit test) in them (30,000)

Analysis so far:
The transaction was aborted because no heartbeat was sent by the transaction_coordinator for longer than 5 seconds:

ts2|pid10692|:16130 I1102 17:57:52.288962 10718 transaction_coordinator.cc:403] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Handle: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING
....
ts2|pid10692|:16130 I1102 17:57:57.353188 12200 transaction_coordinator.cc:403] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Handle: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING

However, the postgres process was indeed trying to send the heartbeat multiple times over the above period, however they all failed with error stating that the TS it was reaching out to was not the leader

ts1|pid10661|:26561 W1102 17:57:46.249198 11929 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Timed out (yb/rpc/outbound_call.cc:501): UpdateTransaction RPC (request call id 3112) to 127.166.218.43:26561 timed out after 0.499s, state: kRunning
ts1|pid10661|:26561 W1102 17:57:46.771456 11929 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Timed out (yb/rpc/outbound_call.cc:501): UpdateTransaction RPC (request call id 3116) to 127.166.218.43:26561 timed out after 0.435s, state: kRunning
ts1|pid10661|:26561 W1102 17:57:47.290393 10905 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Timed out (yb/rpc/outbound_call.cc:501): UpdateTransaction RPC (request call id 3120) to 127.166.218.43:26561 timed out after 0.433s, state: kRunning
ts1|pid10661|:26561 W1102 17:57:47.800714 10903 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Timed out (yb/rpc/rpc.cc:211):
 UpdateTransaction: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING }
propagated_hybrid_time: 6700547552421593088, retrier: { task_id: -1 state: kRunning deadline: 955.149s } passed its deadline 955.149s (passed: 0.510s): Illegal
state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15), state: kRunning
ts1|pid10661|:26561 W1102 17:57:48.336452 10902 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Timed out (yb/rpc/rpc.cc:211):
 UpdateTransaction: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING }
propagated_hybrid_time: 6700547554511958016, retrier: { task_id: -1 state: kRunning deadline: 955.659s } passed its deadline 955.659s (passed: 0.536s): Illegal
state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15), state: kRunning
ts1|pid10661|:26561 W1102 17:57:52.782088 10905 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Timed out (yb/rpc/outbound_call.cc:501): UpdateTransaction RPC (request call id 3217) to 127.154.211.68:16130 timed out after 0.500s, state: kRunning
.
.
.
.
ts1|pid10661|:26561 W1102 17:57:57.897202 11928 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Operation expired (yb/tablet/transaction_coordinator.cc:1230): Transaction 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140 expired or aborted by a conflict: 40001 (pgsql error 40001)

This seems to occur because of ping-ponging leadership changes for the tablet 46b21bb306144b82b906ab8aa0ba8718. Tracing back its leadership changes:

Initially TS2 (512e973f7bc743c2965e663c455fd54b) was the leader for this tablet.

ts2|pid10692|:16130 I1102 17:56:10.864859 10698 consensus_meta.cc:298] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: Updating active role from FOLLOWER to LEADER. Consensus state: current_term: 1 leader_uuid: "512e973f7bc743c2965e663c455fd54b" config { opid_index: -1 peers { permanent_uuid: "512e973f7bc743c2965e663c455fd54b" member_type: VOTER last_known_private_addr { host: "127.154.211.68" port: 16130 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "cb5289750a074c3a81e1c786da5b855f" member_type: VOTER last_known_private_addr { host: "127.166.218.43" port: 26561 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "8e3bb6cc064c48f99b473b8015d52a84" member_type: VOTER last_known_private_addr { host: "127.64.6.253" port: 15311 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } }, has_pending_config = 0

Then YB-Master (cluster_balance) switched the leadership to TS1 cb5289750a074c3a81e1c786da5b855f

m1|pid10469|:27818 I1102 17:56:12.612680 10567 cluster_balance.cc:1332] Moving leader of 46b21bb306144b82b906ab8aa0ba8718 from TS 512e973f7bc743c2965e663c455fd54b to cb5289750a074c3a81e1c786da5b855f
m1|pid10469|:27818 I1102 17:56:12.613368 10880 async_rpc_tasks.cc:1101] Async Leader Stepdown (task=0x705ed38, state=kRunning): Stepping down leader 512e973f7bc743c2965e663c455fd54b for tablet 46b21bb306144b82b906ab8aa0ba8718

This was fine for about 1.5 minutes. However out of the blue, these are TS1 logs:

ts1|pid10661|:26561 I1102 17:57:46.011170 11129 transaction_coordinator.cc:514] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 21b67d03-abf5-4eb9-80c8-a3cee7863c69: Applied to 48fbc6faeb4b4abd8e31a87fe2aaaeb1, left not applied: 3
ts1|pid10661|:26561 W1102 17:57:46.249034 11929 tablet_rpc.cc:450] Timed out (yb/rpc/outbound_call.cc:501): Failed UpdateTransaction: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } propagated_hybrid_time: 6700547546066542592, retrier: { task_id: -1 state: kIdle deadline: 953.596s } to tablet 46b21bb306144b82b906ab8aa0ba8718 on tablet server { uuid: cb5289750a074c3a81e1c786da5b855f private: [host: "127.166.218.43" port: 26561] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" after 1 attempt(s): UpdateTransaction RPC (request call id 3112) to 127.166.218.43:26561 timed out after 0.499s

I couldnt see anything jump out at me as to why this failure started to happen. However this did kick off a leader election, and TS2 became the leader:

ts2|pid10692|:16130 I1102 17:57:48.318205 12065 consensus_meta.cc:298] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: Updating active role from FOLLOWER to LEADER. Consensus state: current_term: 3 leader_uuid: "512e973f7bc743c2965e663c455fd54b" config { opid_index: -1 peers { permanent_uuid: "512e973f7bc743c2965e663c455fd54b" member_type: VOTER last_known_private_addr { host: "127.154.211.68" port: 16130 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "cb5289750a074c3a81e1c786da5b855f" member_type: VOTER last_known_private_addr { host: "127.166.218.43" port: 26561 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "8e3bb6cc064c48f99b473b8015d52a84" member_type: VOTER last_known_private_addr { host: "127.64.6.253" port: 15311 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } }, has_pending_config = 0

TS1 completes its step down successfully.

ts2|pid10692|:16130 I1102 17:57:48.961429 11942 transaction_coordinator.cc:403] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Handle: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING
ts2|pid10692|:16130 I1102 17:57:48.961488 11942 transaction_coordinator.cc:610] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: DoHandle, replicating = <NULL>
ts2|pid10692|:16130 I1102 17:57:48.961511 11942 transaction_coordinator.cc:651] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: SubmitUpdateStatus, replicating = 0x0000000009195dd0 -> { type: kUpdateTransaction consensus_round: <NULL> }
ts2|pid10692|:16130 I1102 17:57:48.993568 11965 transaction_coordinator.cc:228] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ProcessReplicated: { leader_term: 3 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.182 hybrid_time: { physical: 1635875868961854 } }, replicating: 0x0000000009195dd0 -> { type: kUpdateTransaction consensus_round: 0x000000000603aa40 -> id { term: 3 index: 182 } hybrid_time: 6700547559267753984 op_type: UPDATE_TRANSACTION_OP committed_op_id { term: 3 index: 181 } monotonic_counter: 0 transaction_state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } }
ts2|pid10692|:16130 I1102 17:57:48.993721 11965 transaction_coordinator.cc:1165] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: Processed: { leader_term: 3 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.182 hybrid_time: { physical: 1635875868961854 } }
ts3|pid10727|:15311 I1102 17:57:48.995158 12092 transaction_coordinator.cc:228] T 46b21bb306144b82b906ab8aa0ba8718 P 8e3bb6cc064c48f99b473b8015d52a84 ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ProcessReplicated: { leader_term: -1 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.182 hybrid_time: { physical: 1635875868961854 } }, replicating: <NULL>

YB-Master again tried to switch up the leader to TS1 but TS2 did not step down:

m1|pid10469|:27818 I1102 17:57:49.387811 10567 cluster_balance.cc:1332] Moving leader of 46b21bb306144b82b906ab8aa0ba8718 from TS 512e973f7bc743c2965e663c455fd54b to cb5289750a074c3a81e1c786da5b855f
ts2|pid10692|:16130 I1102 17:57:49.579746 12039 tablet_service.cc:2651] Received Leader stepdown RPC: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" dest_uuid: "512e973f7bc743c2965e663c455fd54b" new_leader_uuid: "cb5289750a074c3a81e1c786da5b855f"
ts2|pid10692|:16130 I1102 17:57:49.579952 12039 tablet_service.cc:2664] Leader stepdown request tablet_id: "46b21bb306144b82b906ab8aa0ba8718" dest_uuid: "512e973f7bc743c2965e663c455fd54b" new_leader_uuid: "cb5289750a074c3a81e1c786da5b855f" success. Resp code=LEADER_NOT_READY_TO_STEP_DOWN
m1|pid10469|:27818 I1102 17:57:49.593930 12174 async_rpc_tasks.cc:1121] Async Leader Stepdown (task=0x6662978, state=kComplete): Leader step down done attempt=1, leader_uuid=512e973f7bc743c2965e663c455fd54b, change_uuid=512e973f7bc743c2965e663c455fd54b, error=code: LEADER_NOT_READY_TO_STEP_DOWN status { code: ILLEGAL_STATE message: "Suggested peer is not caught up yet" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 822 errors: "\000" }, failed=1, should_remove=0 for tablet 46b21bb306144b82b906ab8aa0ba8718.

7)Seeing these logs from other nodes:

ts1|pid10661|:26561 W1102 17:57:50.375286 12124 transaction_participant.cc:526] T 0b0127d83c2944f98e6d1cec20e12413 P cb5289750a074c3a81e1c786da5b855f: Apply of unknown transaction: { leader_term: -1 transaction_id: 21b67d03-abf5-4eb9-80c8-a3cee7863c69 op_id: 1.63 commit_ht: { physical: 1635875864488609 } log_ht: { physical: 1635875869822960 } sealed: 0 status_tablet: 46b21bb306144b82b906ab8aa0ba8718 apply_state: <NULL> }
ts1|pid10661|:26561 I1102 17:57:50.693392 12105 consensus_peers.cc:230] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f -> Peer 512e973f7bc743c2965e663c455fd54b ([host: "127.154.211.68" port: 16130], []): Could not obtain request from queue for peer: Not found (yb/consensus/consensus_queue.cc:462): Peer not tracked or queue not in leader mode.

ts3|pid10727|:15311 I1102 17:57:50.713989 11119 transaction_coordinator.cc:297] T 46b21bb306144b82b906ab8aa0ba8718 P 8e3bb6cc064c48f99b473b8015d52a84 ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ClearRequests: Illegal state (yb/tablet/transaction_coordinator.cc:253): Leader changed, replicating: <NULL>
ts3|pid10727|:15311 I1102 17:57:50.714025 11119 transaction_coordinator.cc:1165] T 46b21bb306144b82b906ab8aa0ba8718 P 8e3bb6cc064c48f99b473b8015d52a84: Processed: { leader_term: -1 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.185 hybrid_time: { physical: 1635875870704395 } }
ts1|pid10661|:26561 I1102 17:57:50.708238 12201 consensus_peers.cc:230] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f -> Peer 8e3bb6cc064c48f99b473b8015d52a84 ([host: "127.64.6.253" port: 15311], []): Could not obtain request from queue for peer: Not found (yb/consensus/consensus_queue.cc:462): Peer not tracked or queue not in leader mode.
ts1|pid10661|:26561 I1102 17:57:50.717396 11468 consensus_peers.cc:514] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f -> Peer 8e3bb6cc064c48f99b473b8015d52a84 ([host: "127.64.6.253" port: 15311], []): Closing peer

ts1|pid10661|:26561 I1102 17:57:50.717396 11468 consensus_peers.cc:514] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f -> Peer 8e3bb6cc064c48f99b473b8015d52a84 ([host: "127.64.6.253" port: 15311], []): Closing peer
ts1|pid10661|:26561 I1102 17:57:50.717460 12206 consensus_peers.cc:230] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f -> Peer 512e973f7bc743c2965e663c455fd54b ([host: "127.154.211.68" port: 16130], []): Could not obtain request from queue for peer: Not found (yb/consensus/consensus_queue.cc:462): Peer not tracked or queue not in leader mode.
ts1|pid10661|:26561 I1102 17:57:50.717574 11468 consensus_peers.cc:514] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f -> Peer 512e973f7bc743c2965e663c455fd54b ([host: "127.154.211.68" port: 16130], []): Closing peer

Some operations were aborted:

ts1|pid10661|:26561 I1102 17:57:50.717725 11468 raft_consensus.cc:3293] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f [term 2 FOLLOWER]: Advancing to term 3
d10661|:26561 I1102 17:57:50.739455 11468 raft_consensus.cc:1692] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f [term 3 FOLLOWER]: Refusing update from remote peer 512e973f7bc743c2965e663c455fd54b: Log matching property violated. Preceding OpId in replica: 2.181. Preceding OpId from leader: 3.180. (term mismatch)
ts1|pid10661|:26561 I1102 17:57:50.739492 11468 replica_state.cc:566] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f [term 3 FOLLOWER]: Aborting all operations after (but not including): 179. Current State: Replica: cb5289750a074c3a81e1c786da5b855f, State: 1, Role: FOLLOWER, Watermarks: {
Received: 2.181 Committed: 2.179} Leader: 0.0
ts1|pid10661|:26561 I1102 17:57:50.739542 11468 replica_state.cc:599] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f [term 3 FOLLOWER]: Aborting uncommitted operation due to leader change: 2.181, committed: 2.179
ts1|pid10661|:26561 I1102 17:57:50.739658 11468 update_txn_operation.cc:79] T 46b21bb306144b82b906ab8aa0ba8718 0x0000000005efebd0 -> {
 type: kUpdateTransaction consensus_round: 0x0000000007e3f540 -> id {
 term: 2 index: 181 } hybrid_time: 6700547554789031936 op_type: UPDATE_TRANSACTION_OP committed_op_id {
 term: 2 index: 179 } monotonic_counter: 0 transaction_state {
 transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLIED_IN_ALL_INVOLVED_TABLETS } }: Aborted: Aborted (yb/consensus/replica_state.cc:595): Operation aborted by new leader
ts1|pid10661|:26561 I1102 17:57:50.739810 11468 transaction_coordinator.cc:281] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 21b67d03-abf5-4eb9-80c8-a3cee7863c69: ProcessAborted: transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLIED_IN_ALL_INVOLVED_TABLETS, replicating: 0x0000000005efebd0 -> {
 type: kUpdateTransaction consensus_round: 0x0000000007e3f540 -> id {
 term: 2 index: 181 } hybrid_time: 6700547554789031936 op_type: UPDATE_TRANSACTION_OP committed_op_id {
 term: 2 index: 179 } monotonic_counter: 0 transaction_state {
 transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLIED_IN_ALL_INVOLVED_TABLETS } }
ts1|pid10661|:26561 I1102 17:57:50.739876 11468 transaction_coordinator.cc:297] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 21b67d03-abf5-4eb9-80c8-a3cee7863c69: ClearRequests: Aborted (yb/tablet/transaction_coordinator.cc:292): Replication failed, replicating: <NULL>
ts1|pid10661|:26561 I1102 17:57:50.739912 11468 transaction_coordinator.cc:1195] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f: Aborted, state: transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLIED_IN_ALL_INVOLVED_TABLETS, op id: 2.181
ts1|pid10661|:26561 I1102 17:57:50.740075 11468 replica_state.cc:599] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f [term 3 FOLLOWER]: Aborting uncommitted operation due to leader change: 2.180, committed: 2.179
ts1|pid10661|:26561 I1102 17:57:50.739810 11468 transaction_coordinator.cc:281] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 21b67d03-abf5-4eb9-80c8-a3cee7863c69: ProcessAborted: transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLIED_IN_ALL_INVOLVED_TABLETS, replicating: 0x0000000005efebd0 -> { type: kUpdateTransaction consensus_round: 0x0000000007e3f540 -> id { term: 2 index: 181 } hybrid_time: 6700547554789031936 op_type: UPDATE_TRANSACTION_OP committed_op_id { term: 2 index: 179 } monotonic_counter: 0 transaction_state { transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLIED_IN_ALL_INVOLVED_TABLETS } }
ts1|pid10661|:26561 I1102 17:57:50.739876 11468 transaction_coordinator.cc:297] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 21b67d03-abf5-4eb9-80c8-a3cee7863c69: ClearRequests: Aborted (yb/tablet/transaction_coordinator.cc:292): Replication failed, replicating: <NULL>
ts1|pid10661|:26561 I1102 17:57:50.739912 11468 transaction_coordinator.cc:1195] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f: Aborted, state: transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLIED_IN_ALL_INVOLVED_TABLETS, op id: 2.181
ts1|pid10661|:26561 I1102 17:57:50.740075 11468 replica_state.cc:599] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f [term 3 FOLLOWER]: Aborting uncommitted operation due to leader change: 2.180, committed: 2.179
ts1|pid10661|:26561 I1102 17:57:50.740202 11468 transaction_coordinator.cc:281] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ProcessAborted: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING, replicating: 0x0000000007536ab0 -> { type: kUpdateTransaction consensus_round: 0x0000000005a4b680 -> id { term: 2 index: 180 } hybrid_time: 6700547554788802560 op_type: UPDATE_TRANSACTION_OP committed_op_id { term: 2 index: 179 } monotonic_counter: 0 transaction_state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } }
ts1|pid10661|:26561 I1102 17:57:50.740243 11468 transaction_coordinator.cc:297] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ClearRequests: Aborted (yb/tablet/transaction_coordinator.cc:292): Replication failed, replicating: <NULL>
ts1|pid10661|:26561 I1102 17:57:50.740748 11468 transaction_coordinator.cc:1195] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f: Aborted, state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING, op id: 2.180
ts1|pid10661|:26561 I1102 17:57:50.741080 11468 raft_consensus.cc:1922] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f [term 3 FOLLOWER]: Returning from UpdateConsensus because of error: error { code: PRECEDING_ENTRY_DIDNT_MATCH status { code: ILLEGAL_STATE message: "Log matching property violated. Preceding OpId in replica: 2.181. Preceding OpId from leader: 3.180. (term mismatch)" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 1690 errors: "\000" } }

For a couple of seconds, there were no errors at this point:

ts2|pid10692|:16130 I1102 17:57:51.772511 12057 transaction_coordinator.cc:228] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ProcessReplicated: { leader_term: 3 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.187 hybrid_time: { physical: 1635875871767324 } }, replicating: 0x0000000006386240 -> { type: kUpdateTransaction consensus_round: 0x00000000075c4080 -> id { term: 3 index: 187 } hybrid_time: 6700547570758959104 op_type: UPDATE_TRANSACTION_OP committed_op_id { term: 3 index: 186 } monotonic_counter: 0 transaction_state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } }
ts2|pid10692|:16130 I1102 17:57:51.772644 12057 transaction_coordinator.cc:1165] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: Processed: { leader_term: 3 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.187 hybrid_time: { physical: 1635875871767324 } }

Transaction coordinator in TS1:
ts1|pid10661|:26561 I1102 17:57:51.778187 12119 transaction_coordinator.cc:228] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ProcessReplicated: { leader_term: -1 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.187 hybrid_time: { physical: 1635875871767324 } }, replicating: <NULL>
ts1|pid10661|:26561 I1102 17:57:51.778281 12119 transaction_coordinator.cc:297] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ClearRequests: Illegal state (yb/tablet/transaction_coordinator.cc:253): Leader changed, replicating: <NULL>
ts1|pid10661|:26561 I1102 17:57:51.778322 12119 transaction_coordinator.cc:1165] T 46b21bb306144b82b906ab8aa0ba8718 P cb5289750a074c3a81e1c786da5b855f: Processed: { leader_term: -1 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.187 hybrid_time: { physical: 1635875871767324 } }

TS3 coordinator:
ts3|pid10727|:15311 I1102 17:57:51.787726 12092 transaction_coordinator.cc:228] T 46b21bb306144b82b906ab8aa0ba8718 P 8e3bb6cc064c48f99b473b8015d52a84 ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ProcessReplicated: { leader_term: -1 state: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING op_id: 3.187 hybrid_time: { physical: 1635875871767324 } }, replicating: <NULL>
ts3|pid10727|:15311 I1102 17:57:51.787832 12092 transaction_coordinator.cc:297] T 46b21bb306144b82b906ab8aa0ba8718 P 8e3bb6cc064c48f99b473b8015d52a84 ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: ClearRequests: Illegal state (yb/tablet/transaction_coordinator.cc:253): Leader changed, replicating: <NULL>

ts2|pid10692|:16130 I1102 17:57:52.288962 10718 transaction_coordinator.cc:403] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Handle: transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING
ts2|pid10692|:16130 I1102 17:57:52.289029 10718 transaction_coordinator.cc:610] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: DoHandle, replicating = <NULL>
ts2|pid10692|:16130 I1102 17:57:52.289068 10718 transaction_coordinator.cc:651] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b ID 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: SubmitUpdateStatus, replicating = 0x0000000006608480 -> { type: kUpdateTransaction consensus_round: <NULL> }

However, the UpdateTransaction RPCs again started to fail

ts1|pid10661|:26561 W1102 17:57:52.781981 10905 tablet_rpc.cc:450] Timed out (yb/rpc/outbound_call.cc:501): Failed UpdateTransaction: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } propagated_hybrid_time: 6700547572860755968, retrier: { task_id: -1 state: kIdle deadline: 960.138s } to tablet 46b21bb306144b82b906ab8aa0ba8718 on tablet server { uuid: 512e973f7bc743c2965e663c455fd54b private: [host: "127.154.211.68" port: 16130] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" after 1 attempt(s): UpdateTransaction RPC (request call id 3217) to 127.154.211.68:16130 timed out after 0.500s

And attempts to find the new leader failed:

ts1|pid10661|:26561 I1102 17:57:52.804654 10902 tablet_rpc.cc:143] Unable to pick leader for 46b21bb306144b82b906ab8aa0ba8718, replicas: [0x00007f0bac03f6e0 -> { uuid: 8e3bb6cc064c48f99b473b8015d52a84 private: [host: "127.64.6.253" port: 15311] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", 0x00007f0bac004a90 -> { uuid: cb5289750a074c3a81e1c786da5b855f private: [host: "127.166.218.43" port: 26561] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1"], followers: [{0x00007f0bac004a90 -> { uuid: cb5289750a074c3a81e1c786da5b855f private: [host: "127.166.218.43" port: 26561] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15) time: 0.004s }}, {0x00007f0bac03f6e0 -> { uuid: 8e3bb6cc064c48f99b473b8015d52a84 private: [host: "127.64.6.253" port: 15311] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1", { status: Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15) time: 0.021s }}]

ts1|pid10661|:26561 I1102 17:57:52.852095 11929 meta_cache.cc:815] 0x00007f0bbc037338 -> LookupByIdRpc(tablet: 46b21bb306144b82b906ab8aa0ba8718, num_attempts: 2): Failed, got resp error NOT_THE_LEADER
ts1|pid10661|:26561 W1102 17:57:53.297384 10902 meta_cache.cc:787] 0x00007f0bbc037338 -> LookupByIdRpc(tablet: 46b21bb306144b82b906ab8aa0ba8718, num_attempts: 2): Failed to determine new Master: Timed out (yb/rpc/rpc.cc:211): GetLeaderMasterRpc(addrs: [127.104.46.224:27818, 127.131.60.231:20062, 127.201.246.68:16434], num_attempts: 15) passed its deadline 960.640s (passed: 0.440s): Not found (yb/master/master_rpc.cc:286): no leader found: GetLeaderMasterRpc(addrs: [127.104.46.224:27818, 127.131.60.231:20062, 127.201.246.68:16434], num_attempts: 1)

Started to see these failure logs about leader not having acquired a lease and other errors:

ts2|pid10692|:16130 W1102 17:57:54.737035 10710 tablet_rpc.cc:450] Timed out (yb/rpc/rpc.cc:211): Failed UpdateTransaction: tablet_id: "c8f7f23121d44d27aed473a7e521493f" state { transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLYING tablets: "46b21bb306144b82b906ab8aa0ba8718" commit_hybrid_time: 6700547540945342464 sealed: false aborted { } } propagated_hybrid_time: 6700547562309926912, retrier: { task_id: -1 state: kRunning deadline: 962.063s } to tablet c8f7f23121d44d27aed473a7e521493f (no tablet server available) after 21 attempt(s): UpdateTransaction: tablet_id: "c8f7f23121d44d27aed473a7e521493f" state { transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLYING tablets: "46b21bb306144b82b906ab8aa0ba8718" commit_hybrid_time: 6700547540945342464 sealed: false aborted { } } propagated_hybrid_time: 6700547562309926912, retrier: { task_id: -1 state: kRunning deadline: 962.063s } passed its deadline 962.063s (passed: 5.030s): Leader does not have a valid lease. (yb/consensus/consensus.cc:125): This leader has not yet acquired a lease. (tablet server error 15)
ts2|pid10692|:16130 W1102 17:57:54.737154 10710 transaction_coordinator.cc:1331] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: Failed to send apply for transaction: 21b67d03-abf5-4eb9-80c8-a3cee7863c69: Timed out (yb/rpc/rpc.cc:211): UpdateTransaction: tablet_id: "c8f7f23121d44d27aed473a7e521493f" state { transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLYING tablets: "46b21bb306144b82b906ab8aa0ba8718" commit_hybrid_time: 6700547540945342464 sealed: false aborted { } } propagated_hybrid_time: 6700547562309926912, retrier: { task_id: -1 state: kRunning deadline: 962.063s } passed its deadline 962.063s (passed: 5.030s): Leader does not have a valid lease. (yb/consensus/consensus.cc:125): This leader has not yet acquired a lease. (tablet server error 15)
ts2|pid10692|:16130 W1102 17:57:54.765192 10708 transaction_coordinator.cc:1331] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: Failed to send apply for transaction: 21b67d03-abf5-4eb9-80c8-a3cee7863c69: Timed out (yb/rpc/rpc.cc:211): UpdateTransaction: tablet_id: "ac1351a0369e46078479731a9a6cf7a9" state { transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLYING tablets: "46b21bb306144b82b906ab8aa0ba8718" commit_hybrid_time: 6700547540945342464 sealed: false aborted { } } propagated_hybrid_time: 6700547562309926912, retrier: { task_id: -1 state: kRunning deadline: 962.063s } passed its deadline 962.063s (passed: 5.058s): Leader not ready to serve requests. (yb/consensus/consensus.cc:117): Leader not yet replicated NoOp to be ready to serve requests (tablet server error 24)
ts2|pid10692|:16130 W1102 17:57:54.765770 10710 transaction_coordinator.cc:1331] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: Failed to send apply for transaction: 21b67d03-abf5-4eb9-80c8-a3cee7863c69: Timed out (yb/rpc/rpc.cc:211): UpdateTransaction: tablet_id: "2f4ca9c1ccf2427d90d2a6188a97958a" state { transaction_id: "!\266}\003\253\365N\271\200\310\243\316\347\206<i" status: APPLYING tablets: "46b21bb306144b82b906ab8aa0ba8718" commit_hybrid_time: 6700547540945342464 sealed: false aborted { } } propagated_hybrid_time: 6700547562309926912, retrier: { task_id: -1 state: kRunning deadline: 962.063s } passed its deadline 962.063s (passed: 5.058s): Leader not ready to serve requests. (yb/consensus/consensus.cc:117): Leader not yet replicated NoOp to be ready to serve requests (tablet server error 24)

Meanwhile the transaction has not seen any progress for 5 seconds and is finally aborted:

ts1|pid10661|:26561 W1102 17:57:54.932397 10902 tablet_rpc.cc:450] Timed out (yb/rpc/rpc.cc:211): Failed UpdateTransaction: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } propagated_hybrid_time: 6700547581618536448, retrier: { task_id: -1 state: kRunning deadline: 962.276s } to tablet 46b21bb306144b82b906ab8aa0ba8718 (no tablet server available) after 4 attempt(s): UpdateTransaction: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } propagated_hybrid_time: 6700547581618536448, retrier: { task_id: -1 state: kRunning deadline: 962.276s } passed its deadline 962.276s (passed: 0.514s): Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15)
ts1|pid10661|:26561 W1102 17:57:54.932484 10902 transaction.cc:1065] 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140: Send heartbeat failed: Timed out (yb/rpc/rpc.cc:211): UpdateTransaction: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" state { transaction_id: "hr\327\303\260\313I\372\263\315\267|\300\305\261@" status: PENDING } propagated_hybrid_time: 6700547581618536448, retrier: { task_id: -1 state: kRunning deadline: 962.276s } passed its deadline 962.276s (passed: 0.514s): Illegal state (yb/consensus/consensus.cc:113): Not the leader (tablet server error 15), state: kRunning
ts2|pid10692|:16130 I1102 17:57:58.316251 11152 transaction_coordinator.cc:960] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: GetStatus: 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140 => { status: ABORTED status_time: <max> }
m2|pid10494|:20062 I1102 17:57:58.654997 10611 meta_cache.cc:815] 0x000000000500e018 -> LookupByIdRpc(tablet: 46b21bb306144b82b906ab8aa0ba8718, num_attempts: 1): Failed, got resp error NOT_THE_LEADER
ts2|pid10692|:16130 I1102 17:57:58.666059 11151 transaction_coordinator.cc:960] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: GetStatus: 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140 => { status: ABORTED status_time: <max> }
ts2|pid10692|:16130 I1102 17:57:58.675885 11138 transaction_coordinator.cc:960] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b: GetStatus: 6872d7c3-b0cb-49fa-b3cd-b77cc0c5b140 => { status: ABORTED status_time: <max> }

The leadership went back to TS1

m3|pid10521|:16434 I1102 17:58:08.900265 10574 cluster_balance.cc:1332] Moving leader of 46b21bb306144b82b906ab8aa0ba8718 from TS 512e973f7bc743c2965e663c455fd54b to cb5289750a074c3a81e1c786da5b855f
m3|pid10521|:16434 I1102 17:58:08.900954 12455 async_rpc_tasks.cc:1101] Async Leader Stepdown (task=0x70f52d8, state=kRunning): Stepping down leader 512e973f7bc743c2965e663c455fd54b for tablet 46b21bb306144b82b906ab8aa0ba8718
ts2|pid10692|:16130 I1102 17:58:08.901890 10792 tablet_service.cc:2651] Received Leader stepdown RPC: tablet_id: "46b21bb306144b82b906ab8aa0ba8718" dest_uuid: "512e973f7bc743c2965e663c455fd54b" new_leader_uuid: "cb5289750a074c3a81e1c786da5b855f"
ts2|pid10692|:16130 I1102 17:58:08.902206 10792 raft_consensus.cc:746] T 46b21bb306144b82b906ab8aa0ba8718 P 512e973f7bc743c2965e663c455fd54b [term 3 LEADER]: Transferring leadership to cb5289750a074c3a81e1c786da5b855f

There were no more errors and all the operations go through successfully for the test after this with expected results. This issue is not seen after I reduce the size of these two tables from 30,000 to say 3000.
This does seem like a cluster overload scenario, but would be good to understand why cluster overload manifests as ping-ponging leadership changes for the Transaction status tablet. Note that this is a PgRegress test which means there was just this single long running operation on one connection at the time of error.

Complete logs can be found here:
org.yb.pgsql.TestPgRegressPartitions-output-3.txt

The text was updated successfully, but these errors were encountered:

Summary: The test fails with the following error: ``` *** /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/expected/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.730866113 +0000 --- /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/results/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.603854965 +0000 *************** *** 710,715 **** --- 710,716 ---- ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30); INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i; ANALYZE pagg_tab_ml; + ERROR: Operation failed. Try again: Transaction aborted: b69768c7-0be1-435b-aa86-ce3989f27948 -- For Parallel Append SET max_parallel_workers_per_gather TO 2; -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY ``` This is because of the large number of rows (30,000) in the pagg_tab_ml. This also happens for the table pagg_tab_para which also contains 30,000 rows. GHI [[ #10989 | #10989 ]] tracks why the large number of rows leads to transaction abort, but as part of this fix, the number for rows for this table is reduced. Further flakiness is because the test times out at 1800 seconds (irrespective of the value of getTestMethodTimeout()), which is the timeout for process_tree_supervisor set in common-test-env.sh. Since the Partitions test is now extremely long-running, this patch breaks up the partitions schedule into smaller and more manageable schedules. Running multiple schedules from the same PgRegressTest was causing a conflict while creating the pg_regress output dir. Hence, this patch creates a new subdirectory for the outputdir using the schedule's name. Test Plan: ybd --scb --sj --java-test org.yb.pgsql.TestPgRegressPartitions' Reviewers: alex Reviewed By: alex Subscribers: kannan, yql Differential Revision: https://phabricator.dev.yugabyte.com/D14613

…ggregate Summary: The test fails with the following error: ``` *** /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/expected/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.730866113 +0000 --- /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/results/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.603854965 +0000 *************** *** 710,715 **** --- 710,716 ---- ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30); INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i; ANALYZE pagg_tab_ml; + ERROR: Operation failed. Try again: Transaction aborted: b69768c7-0be1-435b-aa86-ce3989f27948 -- For Parallel Append SET max_parallel_workers_per_gather TO 2; -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY ``` This is because of the large number of rows (30,000) in the pagg_tab_ml. This also happens for the table pagg_tab_para which also contains 30,000 rows. GHI [[ #10989 | #10989 ]] tracks why the large number of rows leads to transaction abort, but as part of this fix, the number for rows for this table is reduced. Further flakiness is because the test times out at 1800 seconds (irrespective of the value of getTestMethodTimeout()), which is the timeout for process_tree_supervisor set in common-test-env.sh. Since the Partitions test is now extremely long-running, this patch breaks up the partitions schedule into smaller and more manageable schedules. Running multiple schedules from the same PgRegressTest was causing a conflict while creating the pg_regress output dir. Hence, this patch creates a new subdirectory for the outputdir using the schedule's name. Original diff: https://phabricator.dev.yugabyte.com/D14613 Original commit: ccabdf7 Test Plan: ybd --scb --sj --java-test org.yb.pgsql.TestPgRegressPartitions' Reviewers: alex, myang Reviewed By: myang Subscribers: myang, yql Differential Revision: https://phabricator.dev.yugabyte.com/D15257

…gregate Summary: The test fails with the following error: ``` *** /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/expected/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.730866113 +0000 --- /tmp/yb_tests__2021-12-12T14_29_06__17676.17249.17116/pgregress_output/results/yb_pg_partition_aggregate.out 2021-12-12 14:47:10.603854965 +0000 *************** *** 710,715 **** --- 710,716 ---- ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30); INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM0000') FROM generate_series(0, 29999) i; ANALYZE pagg_tab_ml; + ERROR: Operation failed. Try again: Transaction aborted: b69768c7-0be1-435b-aa86-ce3989f27948 -- For Parallel Append SET max_parallel_workers_per_gather TO 2; -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY ``` This is because of the large number of rows (30,000) in the pagg_tab_ml. This also happens for the table pagg_tab_para which also contains 30,000 rows. GHI [[ #10989 | #10989 ]] tracks why the large number of rows leads to transaction abort, but as part of this fix, the number for rows for this table is reduced. Further flakiness is because the test times out at 1800 seconds (irrespective of the value of getTestMethodTimeout()), which is the timeout for process_tree_supervisor set in common-test-env.sh. Since the Partitions test is now extremely long-running, this patch breaks up the partitions schedule into smaller and more manageable schedules. Running multiple schedules from the same PgRegressTest was causing a conflict while creating the pg_regress output dir. Hence, this patch creates a new subdirectory for the outputdir using the schedule's name. Original diff: https://phabricator.dev.yugabyte.com/D14613 Original commit: ccabdf7 Test Plan: Jenkins: rebase: 2.8 ybd --scb --sj --java-test org.yb.pgsql.TestPgRegressPartitions' Reviewers: alex Reviewed By: alex Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D15821

deeps1991 added the area/docdb YugabyteDB core features label Jan 4, 2022

deeps1991 assigned spolitov and mbautin Jan 4, 2022

deeps1991 mentioned this issue Jan 4, 2022

[YSQL] [Flaky Test] yb_pg_partition_aggregate fails #10855

Closed

bmatican closed this as completed Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Running ANALYZE on a large table in PgRegressPartitions.java test fails with "Operation failed. Try again.: Transaction aborted" #10989

[DocDB] Running ANALYZE on a large table in PgRegressPartitions.java test fails with "Operation failed. Try again.: Transaction aborted" #10989

deeps1991 commented Jan 4, 2022

[DocDB] Running ANALYZE on a large table in PgRegressPartitions.java test fails with "Operation failed. Try again.: Transaction aborted" #10989

[DocDB] Running ANALYZE on a large table in PgRegressPartitions.java test fails with "Operation failed. Try again.: Transaction aborted" #10989

Comments

deeps1991 commented Jan 4, 2022

Description