Skip to content

Commit

Permalink
[#22849] YSQL: Correctly handle reset phase timeout errors in YSQL Co…
Browse files Browse the repository at this point in the history
…nnection Manager

Summary:
**Problem Statement:**

While a logical client detaches from its corresponding server connection, Connection Manager fires certain RESET queries before allowing the server connection to return back to the pool of usable connections for other logical clients to use. The upstream code has made it such that if these RESET queries take longer than 1 second to execute and return cumulatively, then odyssey breaks the connection and forwards its own error to the client with the message `Resource temporarily unavailable`. It is to be noted that there were no errors while executing the client's workload, and that this is an error formed by the Connection Manager layer itself.

**Solution/Changes Made:**

Since there were technically no errors due to the client's workload, it makes no sense to forward any error to the logical client - we can instead silently close the server connection when it is found that we hit the reset timeout during the reset phase. Additionally, increase the reset timeout in this diff as the client's workload is already pipelined and is not dependent on the server's reset phase completing. It will help save some cycles internally if we can re-use the same server connection.

We modify a function, `od_backend_query` to return a different status when the reset timeout is hit, which can be used for any "reset" query that is fired at the time of detaching the logical and physical connection. Some examples of this can be seen with the help of a small excerpt from the conn mgr template file, picked up from pool config settings:

```
pool_discard no
pool_rollback yes
```

Each of these values correspond to what should be done to the server connection with respect to "resetting" the context on it as per the user's wish
- `discard` fires a "discard all" query
- `rollback` fires a "rollback" if it is found that the server is in an active transaction before firing reset statements
- By default, we fire a query that contains "reset all" "reset role" and "reset session authorization"
All of the above scenarios use the same function being modified to return a different status depending on the reset type
Jira: DB-11746

Test Plan: Jenkins: enable connection manager, all tests

Reviewers: skumar, asrinivasan

Reviewed By: asrinivasan

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D35564
  • Loading branch information
rahulb-yb committed Jul 2, 2024
1 parent 059b855 commit 5dc5ee7
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 7 deletions.
4 changes: 3 additions & 1 deletion src/odyssey/sources/backend.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ void od_backend_close(od_server_t *server)
assert(server->tls == NULL);
server->is_transaction = 0;
server->yb_sticky_connection = false;
server->reset_timeout = false;
server->idle_time = 0;
kiwi_key_init(&server->key);
kiwi_key_init(&server->key_client);
Expand Down Expand Up @@ -736,7 +737,8 @@ int od_backend_ready_wait(od_server_t *server, char *context, int count,
"read error: %s",
od_io_error(&server->io));
}
return -1;
/* return new status if timeout error */
return -2;
}
kiwi_be_type_t type = *(char *)machine_msg_data(msg);
od_debug(&instance->logger, context, server->client, server,
Expand Down
14 changes: 9 additions & 5 deletions src/odyssey/sources/reset.c
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ int od_reset(od_server_t *server)
*
* 3. Continue with (1)
*/
int wait_timeout = 1000;
int wait_timeout = 5000;
int wait_try = 0;
int wait_try_cancel = 0;
int wait_cancel_limit = 1;
Expand All @@ -66,7 +66,8 @@ int od_reset(od_server_t *server)
wait_try++;
rc = od_backend_ready_wait(server, "reset", 1,
wait_timeout);
if (rc == -1)
/* can be -1 or -2 */
if (rc < 0)
break;
}
if (rc == -1) {
Expand Down Expand Up @@ -113,7 +114,7 @@ int od_reset(od_server_t *server)
query_rlb, NULL,
sizeof(query_rlb), wait_timeout,
1);
if (rc == -1)
if (rc < 0)
goto error;
assert(!server->is_transaction);
}
Expand All @@ -125,7 +126,7 @@ int od_reset(od_server_t *server)
rc = od_backend_query(server, "reset-discard", query_discard,
NULL, sizeof(query_discard), wait_timeout,
1);
if (rc == NOT_OK_RESPONSE)
if (rc < 0)
goto error;
}

Expand All @@ -136,7 +137,7 @@ int od_reset(od_server_t *server)
rc = od_backend_query(server, "reset-discard-smart",
query_discard, NULL,
sizeof(query_discard), wait_timeout, 1);
if (rc == NOT_OK_RESPONSE)
if (rc < 0)
goto error;
}

Expand All @@ -147,6 +148,9 @@ int od_reset(od_server_t *server)
NULL, sizeof(query_reset), wait_timeout, 1);
if (rc == -1)
goto error;
/* reset timeout */
if (rc == -2)
server->reset_timeout = true;
}

/* ready */
Expand Down
5 changes: 4 additions & 1 deletion src/odyssey/sources/router.c
Original file line number Diff line number Diff line change
Expand Up @@ -725,8 +725,11 @@ void od_router_detach(od_router_t *router, od_client_t *client)
* a. Creating TEMP TABLES.
* b. Use of WITH HOLD CURSORS.
* c. Client connection is a logical or physical replication connection
* d. It took too long to reset state on the server.
*/
if (od_likely(!server->offline) && !server->yb_sticky_connection) {
if (od_likely(!server->offline) &&
!server->yb_sticky_connection &&
!server->reset_timeout) {
od_instance_t *instance = server->global->instance;
if (route->id.physical_rep || route->id.logical_rep) {
od_debug(&instance->logger, "expire-replication", NULL,
Expand Down
3 changes: 3 additions & 0 deletions src/odyssey/sources/server.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@ struct od_server {

od_list_t link;

/* YB */
bool yb_sticky_connection;
bool reset_timeout;
};

static const size_t OD_SERVER_DEFAULT_HASHMAP_SZ = 420;
Expand All @@ -84,6 +86,7 @@ static inline void od_server_init(od_server_t *server, int reserve_prep_stmts)
server->endpoint_selector = 0;
od_stat_state_init(&server->stats_state);
server->yb_sticky_connection = false;
server->reset_timeout = false;

#ifdef USE_SCRAM
od_scram_state_init(&server->scram_state);
Expand Down

0 comments on commit 5dc5ee7

Please sign in to comment.