-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] tserver core dump occurs in Postgres when a transaction is active and the connection to the tserver is unavailable #18192
Comments
The test fails in 20 min with postgres crash on |
logs showing:
from
|
looks like we update this state to
|
I see the same AbortTransaction issue in |
Observed cc: @qvad |
This issue occurs when cancelling a DDL transaction with {"level":"info","ts":1690844645.7892952,"logger":"ddl_panic","caller":"util/DDLPanic.go:117","msg":"running CREATE","host":"127.0.0.1","port":5433,"user":"postgres","database":"postgres","ssl":false,"backend_id": 1175710}
{"level":"info","ts":1690844645.881921,"logger":"ddl_panic","caller":"util/DDLPanic.go:141","msg":"killing session 1175710","host":"127.0.0.1","port":5433,"user":"postgres","database":"postgres","ssl":false} PID
When the signal handler is called during Because this happens during error recovery, it attempts to recover the session again, and it again attempts to cancel the DDL transaction. This is an infinite loop, and the Postgres backend eventually terminates with |
Possible duplicate of this issue : #17172 |
fcbdb09 addresses issues observed around the retry-ability of ABORTing transactions. |
…action Summary: **Background** Postgres has a small stack (of size 5) to hold error records in re-entrant error scenarios. When any error occurs during the execution of a transaction, the error is pushed into this stack, and postgres attempts to perform transaction error recovery. As part of this recovery, any active transaction and sub-transaction is aborted. If an error occurs during these abort operations, they are further pushed onto the stack, and recovery from this error recovery is attempted, leading to a recursive loop. In YugabyteDB, aborting a transaction requires an RPC to the local tserver which introduces additional modes of failure. Failure to communicate with the tserver during transaction error recovery can cause this recursive loop of errors to overflow the error stack and result in a PANIC. This PANIC is innocuous, because DocDB automatically aborts the transaction after a period of inactivity. **Fix** This revision makes the AbortTransaction flow a best-effort approach so that errors from this flow are handled and not propagated further. The flow is as follows: - If a DML transaction is sought to be aborted (enclosing DDL transaction will also be aborted) via `YBCAbortTransaction`: - Two FinishTransaction RPCs with commit = false are sent to the tserver, first for the DDL, second for the DML. - Irrespective of the success/failure of the RPCs, the transaction state in pggate is cleared. - The status of the RPC is propagated to the pg layer. - In case of any errors, pg closes the backend connection connection. The above flow is also used as part of PG error recovery to abort any ongoing transaction (DDL or DML or both) and clear any transaction state via `YBCAbortTransaction`: - If a DDL transaction is sought to be aborted via `YBResetDdlState` (ie. enclosing DML transaction does not need to be aborted) - A FinishTransaction RPC with commit = false is sent to the local tserver to abort the DDL transaction. - Irrespective of the success/failure of the above RPC, the local DDL transaction state in pggate is cleared. - The status of the RPC is propagated to the pg layer. - In case of any errors, the PG error recovery flow is invoked to abort any enclosing DML transaction. Jira: DB-7215 Test Plan: This revision does not introduce new functionality; it only simplifies existing flows. Testing against a Jenkins run should be sufficient. Reviewers: pjain Reviewed By: pjain Subscribers: smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34725
Summary: D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures. This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity. This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message. Jira: DB-7215 Test Plan: Manual testing. Unit tests to follow in a separate revision. Reviewers: pjain Reviewed By: pjain Subscribers: smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35757
…h for Abort Transaction Summary: Original commit: 1884090 / D34725 **Background** Postgres has a small stack (of size 5) to hold error records in re-entrant error scenarios. When any error occurs during the execution of a transaction, the error is pushed into this stack, and postgres attempts to perform transaction error recovery. As part of this recovery, any active transaction and sub-transaction is aborted. If an error occurs during these abort operations, they are further pushed onto the stack, and recovery from this error recovery is attempted, leading to a recursive loop. In YugabyteDB, aborting a transaction requires an RPC to the local tserver which introduces additional modes of failure. Failure to communicate with the tserver during transaction error recovery can cause this recursive loop of errors to overflow the error stack and result in a PANIC. This PANIC is innocuous, because DocDB automatically aborts the transaction after a period of inactivity. **Fix** This revision makes the AbortTransaction flow a best-effort approach so that errors from this flow are handled and not propagated further. The flow is as follows: - If a DML transaction is sought to be aborted (enclosing DDL transaction will also be aborted) via `YBCAbortTransaction`: - Two FinishTransaction RPCs with commit = false are sent to the tserver, first for the DDL, second for the DML. - Irrespective of the success/failure of the RPCs, the transaction state in pggate is cleared. - The status of the RPC is propagated to the pg layer. - In case of any errors, pg closes the backend connection connection. The above flow is also used as part of PG error recovery to abort any ongoing transaction (DDL or DML or both) and clear any transaction state via `YBCAbortTransaction`: - If a DDL transaction is sought to be aborted via `YBResetDdlState` (ie. enclosing DML transaction does not need to be aborted) - A FinishTransaction RPC with commit = false is sent to the local tserver to abort the DDL transaction. - Irrespective of the success/failure of the above RPC, the local DDL transaction state in pggate is cleared. - The status of the RPC is propagated to the pg layer. - In case of any errors, the PG error recovery flow is invoked to abort any enclosing DML transaction. Jira: DB-7215 Test Plan: This revision does not introduce new functionality; it only simplifies existing flows. Testing against a Jenkins run should be sufficient. Reviewers: pjain Reviewed By: pjain Subscribers: yql, smishra Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36015
…for Abort Transaction Summary: Original commit: 1884090 / D34725 **Background** Postgres has a small stack (of size 5) to hold error records in re-entrant error scenarios. When any error occurs during the execution of a transaction, the error is pushed into this stack, and postgres attempts to perform transaction error recovery. As part of this recovery, any active transaction and sub-transaction is aborted. If an error occurs during these abort operations, they are further pushed onto the stack, and recovery from this error recovery is attempted, leading to a recursive loop. In YugabyteDB, aborting a transaction requires an RPC to the local tserver which introduces additional modes of failure. Failure to communicate with the tserver during transaction error recovery can cause this recursive loop of errors to overflow the error stack and result in a PANIC. This PANIC is innocuous, because DocDB automatically aborts the transaction after a period of inactivity. **Fix** This revision makes the AbortTransaction flow a best-effort approach so that errors from this flow are handled and not propagated further. The flow is as follows: - If a DML transaction is sought to be aborted (enclosing DDL transaction will also be aborted) via `YBCAbortTransaction`: - Two FinishTransaction RPCs with commit = false are sent to the tserver, first for the DDL, second for the DML. - Irrespective of the success/failure of the RPCs, the transaction state in pggate is cleared. - The status of the RPC is propagated to the pg layer. - In case of any errors, pg closes the backend connection connection. The above flow is also used as part of PG error recovery to abort any ongoing transaction (DDL or DML or both) and clear any transaction state via `YBCAbortTransaction`: - If a DDL transaction is sought to be aborted via `YBResetDdlState` (ie. enclosing DML transaction does not need to be aborted) - A FinishTransaction RPC with commit = false is sent to the local tserver to abort the DDL transaction. - Irrespective of the success/failure of the above RPC, the local DDL transaction state in pggate is cleared. - The status of the RPC is propagated to the pg layer. - In case of any errors, the PG error recovery flow is invoked to abort any enclosing DML transaction. Jira: DB-7215 Test Plan: This revision does not introduce new functionality; it only simplifies existing flows. Testing against a Jenkins run should be sufficient. Reviewers: pjain Reviewed By: pjain Subscribers: smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36016
Summary: D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures. This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity. This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message. Jira: DB-7215 Test Plan: Manual testing. Unit tests to follow in a separate revision. Reviewers: pjain Reviewed By: pjain Subscribers: smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35757
…uster tests Summary: This revision introduces the ability to mock tserver responses to pggate RPCs in pg_client_service. The goal is to be able to test hard-to-reproduce failure modes between pggate and the tserver deterministically by adding mocks. As an example, it is now possible to emulate scenarios such as "Introduce network failure for FinishTransaction RPCs in Session X after successful completion of CreateTable RPC" which would previously have required tinkering with a lot of gflags and concurrency constructs. All RPCs in `src/yb/tserver/pg_client.proto` are now mock-able. Jira: DB-7215 Test Plan: Run the following sample test: ``` ./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest-filter 'PgRecursiveAbortTest.AbortAfterTserverShutdown' ``` Reviewers: dmitry, pjain Reviewed By: dmitry Subscribers: ybase, pjain, smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34698
Summary: f8e73e9 [#18192] YSQL: Introduce interface to mock tserver response in MiniCluster tests 4ae68f4 Build break fix for centos7 Excluded: 2ec9224 [#23033] Allow running YSQL upgrade unit tests with snapshot other than 2.0.9.0 37912f1 [#22058] docdb: Disable connections on cloned db until cloning is complete 059b855 [#22908] xCluster: Use XClusterRemoteClient across XCluster 5dc5ee7 [#22849] YSQL: Correctly handle reset phase timeout errors in YSQL Connection Manager af49a1e [#22876][#22835][#22773] CDCSDK: Add new auto flag to identify non-eligible tables in CDC stream f3c4e14 [PLAT-14524] Up-version pekko to fix TLSActor infinite loop 9388aea [#23052] yugabyted: Restarting a node fails when data_dir is missing in user specified configuration. 5cf9736 [PLAT-12685]: Generate a YBA metric for xcluster config table status. 73fc90a [PLAT-14497]: Fix incremental backup time when none full backup exists e9b5ba5 [PLAT-14533]: Modify the gflags metadata support db version check 8dca952 [PLAT-14432][Platform] Show certificate Database Node Certificate/key and Client Certificate/key for CA certs in certificate details modal 6551e45 Add utkarsh.munjal to contributors.md bafa1cb [#21751] YSQL, ASH: Sampling of wait events Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36325
…esponse in MiniCluster tests Summary: Original commit: f8e73e9 / D34698 This revision introduces the ability to mock tserver responses to pggate RPCs in pg_client_service. The goal is to be able to test hard-to-reproduce failure modes between pggate and the tserver deterministically by adding mocks. As an example, it is now possible to emulate scenarios such as "Introduce network failure for FinishTransaction RPCs in Session X after successful completion of CreateTable RPC" which would previously have required tinkering with a lot of gflags and concurrency constructs. All RPCs in `src/yb/tserver/pg_client.proto` are now mock-able. Jira: DB-7215 Test Plan: Run the following sample test: ``` ./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest-filter 'PgRecursiveAbortTest.AbortAfterTserverShutdown' ``` Reviewers: dmitry, pjain Reviewed By: dmitry Subscribers: yql, smishra, pjain, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36278
…ponse in MiniCluster tests Summary: Original commit: f8e73e9 / D34698 This revision introduces the ability to mock tserver responses to pggate RPCs in pg_client_service. The goal is to be able to test hard-to-reproduce failure modes between pggate and the tserver deterministically by adding mocks. As an example, it is now possible to emulate scenarios such as "Introduce network failure for FinishTransaction RPCs in Session X after successful completion of CreateTable RPC" which would previously have required tinkering with a lot of gflags and concurrency constructs. All RPCs in `src/yb/tserver/pg_client.proto` are now mock-able. Jira: DB-7215 Test Plan: Run the following sample test: ``` ./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest-filter 'PgRecursiveAbortTest.AbortAfterTserverShutdown' ``` Reviewers: dmitry, pjain Reviewed By: dmitry Subscribers: yql, smishra, pjain, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36283
… failure. Summary: Original commit: 1d07a89 / D35757 D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures. This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity. This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message. Jira: DB-7215 Test Plan: Manual testing. Unit tests to follow in a separate revision. Reviewers: pjain Reviewed By: pjain Subscribers: yql, smishra Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36785
…ailure. Summary: Original commit: 1d07a89 / D35757 D34725 introduced a best effort approach to abort transactions in order to prevent an error stack overflow in case of repeated failures. This revision extends the same behavior to the abort of subtransactions: if a failure is detected in rolling back to a subtransaction, the backend connection is terminated. This approach is preferred to handling and propagating the error further because of its simplicity. This is helpful from an end user's perspective, as the previous approach produced a core-dump (as a result of a PANIC from stack overflow) which raised a system alert and engaged Support teams for what is an innocuous error. This revision changes the core-dump to a FATAL log message. Jira: DB-7215 Test Plan: Manual testing. Unit tests to follow in a separate revision. Reviewers: pjain Reviewed By: pjain Subscribers: smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36786
Current status:
|
Jira Link: DB-7215
Description
We have observed these panics when a transaction is active in the following scenarios:
This issue can be reproduced by cancelling a DDL transaction with pg_terminate_backend() before it is completed.
In addition, this appears to occur in normal transactions when AbortSubtransaction is called and PostgreSQL process not able to communicate Tserver.
Original test which reproduced this issue:
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: