Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] tserver core dump occurs in Postgres when a transaction is active and the connection to the tserver is unavailable #18192

Closed
1 task done
qvad opened this issue Jul 12, 2023 · 12 comments
Assignees
Labels
2.18 Backport Required 2.18.9_blocker 2.20 Backport Required 2.20.6_blocker 2024.1 Backport Required 2024.1.2_blocker area/ysql Yugabyte SQL (YSQL) blocks_automation Issues marked with this label are blocking QA automation and need developer attention asap. kind/bug This issue is a bug priority/high High Priority qa_stress Bugs identified via Stress automation

Comments

@qvad
Copy link
Contributor

qvad commented Jul 12, 2023

Jira Link: DB-7215

Description

We have observed these panics when a transaction is active in the following scenarios:

  • The PostgreSQL process is terminated, and the messengers are shut down
  • The tserver terminates and PostgreSQL process trying to communicate to the Tserver
  • The tserver is unresponsive and a PostgresService heartbeat fails to be processed in a timely manner

This issue can be reproduced by cancelling a DDL transaction with pg_terminate_backend() before it is completed.
In addition, this appears to occur in normal transactions when AbortSubtransaction is called and PostgreSQL process not able to communicate Tserver.

Original test which reproduced this issue:

Scenario is bank workload with large transactions and wait-on-conflict usage
In parallel we restart AWS nodes.

Test failed in 20 minutes with tserver core dump in  2.19.1.0-b168.
2.19.1.0-b1 version fails with postgres core dump in 2-3h.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@qvad qvad added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Jul 12, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jul 12, 2023
@Karvy-yb
Copy link

The test fails in 20 min with postgres crash on 2.19.1.0-b363
cc: @robertsami

@rthallamko3 rthallamko3 added the area/docdb YugabyteDB core features label Jul 25, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jul 25, 2023
@robertsami
Copy link
Contributor

logs showing:

2023-07-12 05:18:38.033 UTC [1693] WARNING:  AbortTransaction while in ABORT state

from

elog(WARNING, "AbortTransaction while in %s state",

@robertsami
Copy link
Contributor

looks like we update this state to TRANS_ABORT in two places:


@robertsami robertsami assigned tvesely and unassigned robertsami Jul 26, 2023
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue area/docdb YugabyteDB core features labels Jul 31, 2023
@Karvy-yb
Copy link

I see the same AbortTransaction issue in test_intensive_multi_tenancy_workload for version 2.19.1.0-b363

@Karvy-yb
Copy link

Observed SIGABRT issue in test_create_alter_delete_tables_vm_restarts on versions : 2.19.1.0-b379 and 2.19.1.0-b389

cc: @qvad

@robertsami robertsami changed the title [YSQL] tserver core dump occurs in bank workload pessimistic locking scenario with node restarts [YSQL] tserver core dump occurs in postgres when terminating a backend in DDL mode Jul 31, 2023
@tvesely
Copy link
Contributor

tvesely commented Jul 31, 2023

This issue occurs when cancelling a DDL transaction with pg_terminate_backend() before it is complete.

{"level":"info","ts":1690844645.7892952,"logger":"ddl_panic","caller":"util/DDLPanic.go:117","msg":"running CREATE","host":"127.0.0.1","port":5433,"user":"postgres","database":"postgres","ssl":false,"backend_id": 1175710}
{"level":"info","ts":1690844645.881921,"logger":"ddl_panic","caller":"util/DDLPanic.go:141","msg":"killing session 1175710","host":"127.0.0.1","port":5433,"user":"postgres","database":"postgres","ssl":false}

PID 1175710 was cancelled with pg_terminate_backend() in the middle of creating a table, and this results in a PANIC.

I0731 16:04:05.835775 1175710 ybccmds.c:527] Creating Table postgres.public.foo
I0731 16:04:06.073963 1175710 pg_txn_manager.cc:384] ExitSeparateDdlTxnMode: { ddl_type: DdlWithDocdbSchemaChanges read_only: 0 deferrable: 0 txn_in_progress: 1 pg_isolation_level: READ_COMMITTED isolation_level: 0 }; query: { create table if not exists foo(a int primary key, b int); }; 
I0731 16:04:06.074873 1175710 pg_txn_manager.cc:239] CalculateIsolation: { ddl_type: DdlWithDocdbSchemaChanges read_only: 0 deferrable: 0 txn_in_progress: 1 pg_isolation_level: READ_COMMITTED isolation_level: 0 }; query: { create table if not exists foo(a int primary key, b int); }; 
2023-07-31 16:04:06.075 PDT [1175710] ERROR:  Shutdown connection
	/home/dreddor/code/yugabyte-db/build/debug-clang16-dynamic-ninja/../../src/yb/yql/pggate/util/ybc_util.cc:331:     @     0x7fa4a2eb6d0b  YBCGetStackTrace
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/error/../../../../../../../src/postgres/src/backend/utils/error/elog.c:4781:     @     0x55712667a428  yb_errmsg_from_status_data
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/catalog/yb_catalog/../../../../../../../src/postgres/src/backend/catalog/yb_catalog/yb_catalog_version.c:464:     @     0x557125fd4657  YbGetMasterCatalogVersionFromTable
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/catalog/yb_catalog/../../../../../../../src/postgres/src/backend/catalog/yb_catalog/yb_catalog_version.c:58:     @     0x557125fd352b  YbGetMasterCatalogVersion
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:3868:     @     0x557126479006  YBPrepareCacheRefreshIfNeeded
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:5360:     @     0x557126477c12  PostgresMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4658:     @     0x5571263a1598  BackendRun
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4296:     @     0x5571263a0546  BackendStartup
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775:     @     0x55712639f01e  ServerLoop
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431:     @     0x55712639be1a  PostmasterMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:     @     0x5571262995d7  PostgresServerProcessMain
	    @     0x557126299b91 
	../csu/libc-start.c:308:                                                                                @     0x7fa4a2a62082  __libc_start_main
	    @     0x557125e80c6d 
	
2023-07-31 16:04:06.075 PDT [1175710] STATEMENT:  create table if not exists foo(a int primary key, b int);
I0731 16:04:06.075604 1175710 pg_txn_manager.cc:384] ExitSeparateDdlTxnMode: { ddl_type: DdlWithDocdbSchemaChanges read_only: 0 deferrable: 0 txn_in_progress: 1 pg_isolation_level: READ_COMMITTED isolation_level: 0 }; query: { No query }; 
2023-07-31 16:04:06.076 PDT [1175710] ERROR:  Shutdown connection
	/home/dreddor/code/yugabyte-db/build/debug-clang16-dynamic-ninja/../../src/yb/yql/pggate/util/ybc_util.cc:331:     @     0x7fa4a2eb6d0b  YBCGetStackTrace
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/error/../../../../../../../src/postgres/src/backend/utils/error/elog.c:4781:     @     0x55712667a428  yb_errmsg_from_status_data
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/misc/../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:702:     @     0x5571266b85e3  YBCAbortTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:2852:     @     0x557125f827fa  AbortTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:3336:     @     0x557125f83e7b  AbortCurrentTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:5119:     @     0x557126477649  PostgresMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4658:     @     0x5571263a1598  BackendRun
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4296:     @     0x5571263a0546  BackendStartup
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775:     @     0x55712639f01e  ServerLoop
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431:     @     0x55712639be1a  PostmasterMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:     @     0x5571262995d7  PostgresServerProcessMain
	    @     0x557126299b91 
	../csu/libc-start.c:308:                                                                                @     0x7fa4a2a62082  __libc_start_main
	    @     0x557125e80c6d 
	
2023-07-31 16:04:06.076 PDT [1175710] WARNING:  AbortTransaction while in ABORT state
	/home/dreddor/code/yugabyte-db/build/debug-clang16-dynamic-ninja/../../src/yb/yql/pggate/util/ybc_util.cc:331:     @     0x7fa4a2eb6d0b  YBCGetStackTrace
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/error/../../../../../../../src/postgres/src/backend/utils/error/elog.c:1748:     @     0x5571266778db  elog_finish
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:2743:     @     0x557125f8260d  AbortTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:3336:     @     0x557125f83e7b  AbortCurrentTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:5119:     @     0x557126477649  PostgresMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4658:     @     0x5571263a1598  BackendRun
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4296:     @     0x5571263a0546  BackendStartup
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775:     @     0x55712639f01e  ServerLoop
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431:     @     0x55712639be1a  PostmasterMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:     @     0x5571262995d7  PostgresServerProcessMain
	    @     0x557126299b91 
	../csu/libc-start.c:308:                                                                                @     0x7fa4a2a62082  __libc_start_main
	    @     0x557125e80c6d 
	
I0731 16:04:06.076644 1175710 pg_txn_manager.cc:384] ExitSeparateDdlTxnMode: { ddl_type: DdlWithDocdbSchemaChanges read_only: 0 deferrable: 0 txn_in_progress: 1 pg_isolation_level: READ_COMMITTED isolation_level: 0 }; query: { No query }; 
2023-07-31 16:04:06.077 PDT [1175710] ERROR:  Shutdown connection
	/home/dreddor/code/yugabyte-db/build/debug-clang16-dynamic-ninja/../../src/yb/yql/pggate/util/ybc_util.cc:331:     @     0x7fa4a2eb6d0b  YBCGetStackTrace
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/error/../../../../../../../src/postgres/src/backend/utils/error/elog.c:4781:     @     0x55712667a428  yb_errmsg_from_status_data
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/misc/../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:702:     @     0x5571266b85e3  YBCAbortTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:2852:     @     0x557125f827fa  AbortTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:3336:     @     0x557125f83e7b  AbortCurrentTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:5119:     @     0x557126477649  PostgresMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4658:     @     0x5571263a1598  BackendRun
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4296:     @     0x5571263a0546  BackendStartup
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775:     @     0x55712639f01e  ServerLoop
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431:     @     0x55712639be1a  PostmasterMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:     @     0x5571262995d7  PostgresServerProcessMain
	    @     0x557126299b91 
	../csu/libc-start.c:308:                                                                                @     0x7fa4a2a62082  __libc_start_main
	    @     0x557125e80c6d 
	
2023-07-31 16:04:06.077 PDT [1175710] PANIC:  ERRORDATA_STACK_SIZE exceeded
	/home/dreddor/code/yugabyte-db/build/debug-clang16-dynamic-ninja/../../src/yb/yql/pggate/util/ybc_util.cc:331:     @     0x7fa4a2eb6d0b  YBCGetStackTrace
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/error/../../../../../../../src/postgres/src/backend/utils/error/elog.c:1147:     @     0x55712667252a  errmsg_internal
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/utils/error/../../../../../../../src/postgres/src/backend/utils/error/elog.c:1698:     @     0x557126677557  elog_start
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:2743:     @     0x557125f825eb  AbortTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/access/transam/../../../../../../../src/postgres/src/backend/access/transam/xact.c:3336:     @     0x557125f83e7b  AbortCurrentTransaction
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:5119:     @     0x557126477649  PostgresMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4658:     @     0x5571263a1598  BackendRun
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4296:     @     0x5571263a0546  BackendStartup
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775:     @     0x55712639f01e  ServerLoop
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431:     @     0x55712639be1a  PostmasterMain
	/home/dreddor/code/yugabyte-db/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:     @     0x5571262995d7  PostgresServerProcessMain
	    @     0x557126299b91 
	../csu/libc-start.c:308:                                                                                @     0x7fa4a2a62082  __libc_start_main
	    @     0x557125e80c6d 
	
I0731 16:04:06.364205 1175610 postmaster.c:2950] cleaning up after process with pid 1175710 exited with status 134
2023-07-31 16:04:06.364 PDT [1175610] WARNING:  server process (PID 1175710) was terminated by signal 6: Aborted

When the signal handler is called during pg_terminate_backend() the pg_gate client connection to the tserver is terminated. Eventually, during error recovery, ExitSeparateDdlTxnMode() is called, and it tries to cancel the DDL transaction. Because the client connection to the tserver has been disconnected, when it tries to use the client connection to the tserver to cancel the DDL transaction, it fails with ERROR: Shutdown connection.

Because this happens during error recovery, it attempts to recover the session again, and it again attempts to cancel the DDL transaction. This is an infinite loop, and the Postgres backend eventually terminates with PANIC: ERRORDATA_STACK_SIZE exceeded.

@sushantrmishra
Copy link

Possible duplicate of this issue : #17172

@yugabyte-ci yugabyte-ci added priority/medium Medium priority issue and removed priority/high High Priority priority/medium Medium priority issue labels Nov 16, 2023
@karthik-ramanathan-3006
Copy link
Contributor

fcbdb09 addresses issues observed around the retry-ability of ABORTing transactions.
To completely fix this issue, an additional fix is needed to handle subtransactions that experience an error.
Fix is currently in progress.

svarnau pushed a commit that referenced this issue May 25, 2024
…action

Summary:
**Background**
Postgres has a small stack (of size 5) to hold error records in re-entrant error scenarios.
When any error occurs during the execution of a transaction, the error is pushed into this stack, and postgres attempts to perform transaction error recovery.
As part of this recovery, any active transaction and sub-transaction is aborted. If an error occurs during these abort operations, they are further pushed onto the stack,
and recovery from this error recovery is attempted, leading to a recursive loop.
In YugabyteDB, aborting a transaction requires an RPC to the local tserver which introduces additional modes of failure.
Failure to communicate with the tserver during transaction error recovery can cause this recursive loop of errors to overflow the error stack and result in a PANIC.
This PANIC is innocuous, because DocDB automatically aborts the transaction after a period of inactivity.

**Fix**
This revision makes the AbortTransaction flow a best-effort approach so that errors from this flow are handled and not propagated further.
The flow is as follows:
 - If a DML transaction is sought to be aborted (enclosing DDL transaction will also be aborted) via `YBCAbortTransaction`:
   - Two FinishTransaction RPCs with commit = false are sent to the tserver, first for the DDL, second for the DML.
   - Irrespective of the success/failure of the RPCs, the transaction state in pggate is cleared.
   - The status of the RPC is propagated to the pg layer.
   - In case of any errors, pg closes the backend connection connection.

The above flow is also used as part of PG error recovery to abort any ongoing transaction (DDL or DML or both) and clear any transaction state via `YBCAbortTransaction`:

- If a DDL transaction is sought to be aborted via `YBResetDdlState` (ie. enclosing DML transaction
   does not need to be aborted)
  - A FinishTransaction RPC with commit = false is sent to the local tserver to abort the DDL transaction.
  - Irrespective of the success/failure of the above RPC, the local DDL transaction state in pggate is cleared.
  - The status of the RPC is propagated to the pg layer.
  - In case of any errors, the PG error recovery flow is invoked to abort any enclosing DML transaction.
Jira: DB-7215

Test Plan:
This revision does not introduce new functionality; it only simplifies existing flows.
Testing against a Jenkins run should be sufficient.

Reviewers: pjain

Reviewed By: pjain

Subscribers: smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34725
karthik-ramanathan-3006 added a commit that referenced this issue Jun 21, 2024
Summary:
D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures.
This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend
connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity.
This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which
raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message.
Jira: DB-7215

Test Plan:
Manual testing.
Unit tests to follow in a separate revision.

Reviewers: pjain

Reviewed By: pjain

Subscribers: smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35757
karthik-ramanathan-3006 added a commit that referenced this issue Jun 24, 2024
…h for Abort Transaction

Summary:
Original commit: 1884090 / D34725
**Background**
Postgres has a small stack (of size 5) to hold error records in re-entrant error scenarios.
When any error occurs during the execution of a transaction, the error is pushed into this stack, and postgres attempts to perform transaction error recovery.
As part of this recovery, any active transaction and sub-transaction is aborted. If an error occurs during these abort operations, they are further pushed onto the stack,
and recovery from this error recovery is attempted, leading to a recursive loop.
In YugabyteDB, aborting a transaction requires an RPC to the local tserver which introduces additional modes of failure.
Failure to communicate with the tserver during transaction error recovery can cause this recursive loop of errors to overflow the error stack and result in a PANIC.
This PANIC is innocuous, because DocDB automatically aborts the transaction after a period of inactivity.

**Fix**
This revision makes the AbortTransaction flow a best-effort approach so that errors from this flow are handled and not propagated further.
The flow is as follows:
 - If a DML transaction is sought to be aborted (enclosing DDL transaction will also be aborted) via `YBCAbortTransaction`:
   - Two FinishTransaction RPCs with commit = false are sent to the tserver, first for the DDL, second for the DML.
   - Irrespective of the success/failure of the RPCs, the transaction state in pggate is cleared.
   - The status of the RPC is propagated to the pg layer.
   - In case of any errors, pg closes the backend connection connection.

The above flow is also used as part of PG error recovery to abort any ongoing transaction (DDL or DML or both) and clear any transaction state via `YBCAbortTransaction`:

- If a DDL transaction is sought to be aborted via `YBResetDdlState` (ie. enclosing DML transaction
   does not need to be aborted)
  - A FinishTransaction RPC with commit = false is sent to the local tserver to abort the DDL transaction.
  - Irrespective of the success/failure of the above RPC, the local DDL transaction state in pggate is cleared.
  - The status of the RPC is propagated to the pg layer.
  - In case of any errors, the PG error recovery flow is invoked to abort any enclosing DML transaction.
Jira: DB-7215

Test Plan:
This revision does not introduce new functionality; it only simplifies existing flows.
Testing against a Jenkins run should be sufficient.

Reviewers: pjain

Reviewed By: pjain

Subscribers: yql, smishra

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36015
karthik-ramanathan-3006 added a commit that referenced this issue Jun 24, 2024
…for Abort Transaction

Summary:
Original commit: 1884090 / D34725
**Background**
Postgres has a small stack (of size 5) to hold error records in re-entrant error scenarios.
When any error occurs during the execution of a transaction, the error is pushed into this stack, and postgres attempts to perform transaction error recovery.
As part of this recovery, any active transaction and sub-transaction is aborted. If an error occurs during these abort operations, they are further pushed onto the stack,
and recovery from this error recovery is attempted, leading to a recursive loop.
In YugabyteDB, aborting a transaction requires an RPC to the local tserver which introduces additional modes of failure.
Failure to communicate with the tserver during transaction error recovery can cause this recursive loop of errors to overflow the error stack and result in a PANIC.
This PANIC is innocuous, because DocDB automatically aborts the transaction after a period of inactivity.

**Fix**
This revision makes the AbortTransaction flow a best-effort approach so that errors from this flow are handled and not propagated further.
The flow is as follows:
 - If a DML transaction is sought to be aborted (enclosing DDL transaction will also be aborted) via `YBCAbortTransaction`:
   - Two FinishTransaction RPCs with commit = false are sent to the tserver, first for the DDL, second for the DML.
   - Irrespective of the success/failure of the RPCs, the transaction state in pggate is cleared.
   - The status of the RPC is propagated to the pg layer.
   - In case of any errors, pg closes the backend connection connection.

The above flow is also used as part of PG error recovery to abort any ongoing transaction (DDL or DML or both) and clear any transaction state via `YBCAbortTransaction`:

- If a DDL transaction is sought to be aborted via `YBResetDdlState` (ie. enclosing DML transaction
   does not need to be aborted)
  - A FinishTransaction RPC with commit = false is sent to the local tserver to abort the DDL transaction.
  - Irrespective of the success/failure of the above RPC, the local DDL transaction state in pggate is cleared.
  - The status of the RPC is propagated to the pg layer.
  - In case of any errors, the PG error recovery flow is invoked to abort any enclosing DML transaction.
Jira: DB-7215

Test Plan:
This revision does not introduce new functionality; it only simplifies existing flows.
Testing against a Jenkins run should be sufficient.

Reviewers: pjain

Reviewed By: pjain

Subscribers: smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36016
karthik-ramanathan-3006 added a commit to karthik-ramanathan-3006/yugabyte-db that referenced this issue Jun 24, 2024
Summary:
D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures.
This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend
connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity.
This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which
raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message.
Jira: DB-7215

Test Plan:
Manual testing.
Unit tests to follow in a separate revision.

Reviewers: pjain

Reviewed By: pjain

Subscribers: smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35757
karthik-ramanathan-3006 added a commit that referenced this issue Jul 1, 2024
…uster tests

Summary:
This revision introduces the ability to mock tserver responses to pggate RPCs in pg_client_service.
The goal is to be able to test hard-to-reproduce failure modes between pggate and the tserver deterministically by adding mocks.
As an example, it is now possible to emulate scenarios such as "Introduce network failure for FinishTransaction RPCs in Session X after successful completion of CreateTable RPC" which would
previously have required tinkering with a lot of gflags and concurrency constructs.

 All RPCs in `src/yb/tserver/pg_client.proto` are now mock-able.
Jira: DB-7215

Test Plan:
Run the following sample test:
```
./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest-filter 'PgRecursiveAbortTest.AbortAfterTserverShutdown'
```

Reviewers: dmitry, pjain

Reviewed By: dmitry

Subscribers: ybase, pjain, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34698
jasonyb pushed a commit that referenced this issue Jul 2, 2024
Summary:
 f8e73e9 [#18192] YSQL: Introduce interface to mock tserver response in MiniCluster tests
 4ae68f4 Build break fix for centos7
 Excluded: 2ec9224 [#23033] Allow running YSQL upgrade unit tests with snapshot other than 2.0.9.0
 37912f1 [#22058] docdb: Disable connections on cloned db until cloning is complete
 059b855 [#22908] xCluster: Use XClusterRemoteClient across XCluster
 5dc5ee7 [#22849] YSQL: Correctly handle reset phase timeout errors in YSQL Connection Manager
 af49a1e [#22876][#22835][#22773] CDCSDK: Add new auto flag to identify non-eligible tables in CDC stream
 f3c4e14 [PLAT-14524] Up-version pekko to fix TLSActor infinite loop
 9388aea [#23052] yugabyted:  Restarting a node fails when data_dir is missing in user specified configuration.
 5cf9736 [PLAT-12685]: Generate a YBA metric for xcluster config table status.
 73fc90a [PLAT-14497]: Fix incremental backup time when none full backup exists
 e9b5ba5 [PLAT-14533]: Modify the gflags metadata support db version check
 8dca952 [PLAT-14432][Platform] Show certificate Database Node Certificate/key and Client Certificate/key for CA certs in certificate details modal
 6551e45 Add utkarsh.munjal to contributors.md
 bafa1cb [#21751] YSQL, ASH: Sampling of wait events

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36325
karthik-ramanathan-3006 added a commit that referenced this issue Jul 17, 2024
…esponse in MiniCluster tests

Summary:
Original commit: f8e73e9 / D34698
This revision introduces the ability to mock tserver responses to pggate RPCs in pg_client_service.
The goal is to be able to test hard-to-reproduce failure modes between pggate and the tserver deterministically by adding mocks.
As an example, it is now possible to emulate scenarios such as "Introduce network failure for FinishTransaction RPCs in Session X after successful completion of CreateTable RPC" which would
previously have required tinkering with a lot of gflags and concurrency constructs.

 All RPCs in `src/yb/tserver/pg_client.proto` are now mock-able.
Jira: DB-7215

Test Plan:
Run the following sample test:
```
./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest-filter 'PgRecursiveAbortTest.AbortAfterTserverShutdown'
```

Reviewers: dmitry, pjain

Reviewed By: dmitry

Subscribers: yql, smishra, pjain, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36278
karthik-ramanathan-3006 added a commit that referenced this issue Jul 23, 2024
…ponse in MiniCluster tests

Summary:
Original commit: f8e73e9 / D34698
This revision introduces the ability to mock tserver responses to pggate RPCs in pg_client_service.
The goal is to be able to test hard-to-reproduce failure modes between pggate and the tserver deterministically by adding mocks.
As an example, it is now possible to emulate scenarios such as "Introduce network failure for FinishTransaction RPCs in Session X after successful completion of CreateTable RPC" which would
previously have required tinkering with a lot of gflags and concurrency constructs.

 All RPCs in `src/yb/tserver/pg_client.proto` are now mock-able.
Jira: DB-7215

Test Plan:
Run the following sample test:
```
./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest-filter 'PgRecursiveAbortTest.AbortAfterTserverShutdown'
```

Reviewers: dmitry, pjain

Reviewed By: dmitry

Subscribers: yql, smishra, pjain, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36283
karthik-ramanathan-3006 added a commit that referenced this issue Jul 24, 2024
… failure.

Summary:
Original commit: 1d07a89 / D35757
D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures.
This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend
connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity.
This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which
raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message.
Jira: DB-7215

Test Plan:
Manual testing.
Unit tests to follow in a separate revision.

Reviewers: pjain

Reviewed By: pjain

Subscribers: yql, smishra

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36785
karthik-ramanathan-3006 added a commit that referenced this issue Jul 24, 2024
…ailure.

Summary:
Original commit: 1d07a89 / D35757
D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures.
This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend
connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity.
This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which
raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message.
Jira: DB-7215

Test Plan:
Manual testing.
Unit tests to follow in a separate revision.

Reviewers: pjain

Reviewed By: pjain

Subscribers: smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36786
@karthik-ramanathan-3006
Copy link
Contributor

Current status:

Code change description master 2024.1.1 2024.1 2.20 2.18 and beyond
Fix for failed ABORT Transaction Merged Merged Merged Merged Not planned
Fix for failed ABORT SubTransaction Merged Not available Merged Merged Not planned
Interface for testing failures Merged Not available Merged Merged Not planned

@github-project-automation github-project-automation bot moved this from In Review to Done in Wait-Queue Based Locking Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.18 Backport Required 2.18.9_blocker 2.20 Backport Required 2.20.6_blocker 2024.1 Backport Required 2024.1.2_blocker area/ysql Yugabyte SQL (YSQL) blocks_automation Issues marked with this label are blocking QA automation and need developer attention asap. kind/bug This issue is a bug priority/high High Priority qa_stress Bugs identified via Stress automation
Projects
Status: Done
Development

No branches or pull requests