Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Connection Error in Odyssey when reset phase query takes too long to execute #22849

Closed
yugabyte-ci opened this issue Jun 13, 2024 · 0 comments
Assignees
Labels
area/ecosystem Label for all ecosystem related projects jira-originated kind/new-feature This is a request for a completely new feature priority/highest Highest priority issue

Comments

@yugabyte-ci
Copy link
Contributor

yugabyte-ci commented Jun 13, 2024

Jira Link: DB-11746

@yugabyte-ci yugabyte-ci added area/ecosystem Label for all ecosystem related projects jira-originated kind/new-feature This is a request for a completely new feature priority/highest Highest priority issue labels Jun 13, 2024
rahulb-yb added a commit that referenced this issue Jul 2, 2024
…nnection Manager

Summary:
**Problem Statement:**

While a logical client detaches from its corresponding server connection, Connection Manager fires certain RESET queries before allowing the server connection to return back to the pool of usable connections for other logical clients to use. The upstream code has made it such that if these RESET queries take longer than 1 second to execute and return cumulatively, then odyssey breaks the connection and forwards its own error to the client with the message `Resource temporarily unavailable`. It is to be noted that there were no errors while executing the client's workload, and that this is an error formed by the Connection Manager layer itself.

**Solution/Changes Made:**

Since there were technically no errors due to the client's workload, it makes no sense to forward any error to the logical client - we can instead silently close the server connection when it is found that we hit the reset timeout during the reset phase. Additionally, increase the reset timeout in this diff as the client's workload is already pipelined and is not dependent on the server's reset phase completing. It will help save some cycles internally if we can re-use the same server connection.

We modify a function, `od_backend_query` to return a different status when the reset timeout is hit, which can be used for any "reset" query that is fired at the time of detaching the logical and physical connection. Some examples of this can be seen with the help of a small excerpt from the conn mgr template file, picked up from pool config settings:

```
pool_discard no
pool_rollback yes
```

Each of these values correspond to what should be done to the server connection with respect to "resetting" the context on it as per the user's wish
- `discard` fires a "discard all" query
- `rollback` fires a "rollback" if it is found that the server is in an active transaction before firing reset statements
- By default, we fire a query that contains "reset all" "reset role" and "reset session authorization"
All of the above scenarios use the same function being modified to return a different status depending on the reset type
Jira: DB-11746

Test Plan: Jenkins: enable connection manager, all tests

Reviewers: skumar, asrinivasan

Reviewed By: asrinivasan

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D35564
jasonyb pushed a commit that referenced this issue Jul 2, 2024
Summary:
 f8e73e9 [#18192] YSQL: Introduce interface to mock tserver response in MiniCluster tests
 4ae68f4 Build break fix for centos7
 Excluded: 2ec9224 [#23033] Allow running YSQL upgrade unit tests with snapshot other than 2.0.9.0
 37912f1 [#22058] docdb: Disable connections on cloned db until cloning is complete
 059b855 [#22908] xCluster: Use XClusterRemoteClient across XCluster
 5dc5ee7 [#22849] YSQL: Correctly handle reset phase timeout errors in YSQL Connection Manager
 af49a1e [#22876][#22835][#22773] CDCSDK: Add new auto flag to identify non-eligible tables in CDC stream
 f3c4e14 [PLAT-14524] Up-version pekko to fix TLSActor infinite loop
 9388aea [#23052] yugabyted:  Restarting a node fails when data_dir is missing in user specified configuration.
 5cf9736 [PLAT-12685]: Generate a YBA metric for xcluster config table status.
 73fc90a [PLAT-14497]: Fix incremental backup time when none full backup exists
 e9b5ba5 [PLAT-14533]: Modify the gflags metadata support db version check
 8dca952 [PLAT-14432][Platform] Show certificate Database Node Certificate/key and Client Certificate/key for CA certs in certificate details modal
 6551e45 Add utkarsh.munjal to contributors.md
 bafa1cb [#21751] YSQL, ASH: Sampling of wait events

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ecosystem Label for all ecosystem related projects jira-originated kind/new-feature This is a request for a completely new feature priority/highest Highest priority issue
Projects
None yet
Development

No branches or pull requests

2 participants