Cluster update API call may timeout when one of the known hosts is down #4074

karol-kokoszka · 2024-10-18T11:22:49Z

The current logic for updating the cluster in Scylla Manager includes validating host connectivity. This process involves the Manager querying one of the known hosts (stored in the ManagerDB) to discover other cluster hosts via the GET /storage_service/host_id API and mapping them to the correct data center using GET /snitch/datacenter.

After discovering live hosts, the Scylla Manager Agent API on each host is pinged to validate whether the Scylla Manager Agent is installed and responsive.

This is necessary because the Scylla Manager server requires the Scylla Manager Agent to be present on every node in order to execute all supported tasks.

Unfortunately, the logic for discovering all nodes in the cluster involves iterating over all the nodes saved in the database. The Manager keeps trying until it receives a response from the Scylla Manager Agent on one of the nodes. The default timeout for this operation is 30 seconds, with a few retries. The Manager moves on to the next node in the queue if the current node fails to respond.

If the first node in the list is down or unavailable, it blocks the cluster update, forcing it to wait until the server proceeds to a valid node.

The default manager client has the following timeout configuration: Scylla Manager Client Timeout.

This means that even if the cluster update eventually succeeds, the API call to the Manager server might timeout. As a result, the Scylla Manager CLI, which is a wrapper over the Manager server API, will also report an error when updating the cluster.

The Manager server should call all the currently known hosts in parallel and proceed with the first one that responds, canceling calls to the other nodes once a valid response is received.

karol-kokoszka · 2024-10-18T15:27:46Z

Moreover, cluster update is chained with forcing config-cache to update its data about this cluster.
It must be async operation.

scylla-manager/pkg/service/cluster/service.go

Line 487 in d044b17

return s.notifyChangeListener(ctx, changeEvent)

karol-kokoszka · 2024-10-21T09:36:34Z

We should consider to still keep the coordinator-host as the first to check, but there is definitely no need to wait 30sec for the response + retry-backoff it couple of time.
The timeout for the first call can be lowered to 5sec without the retries.
Later on, if the first fails, all nodes should be checked in parallel and the first response should cause other ones to cancel the context.

…date async Fixes #4074 This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN, it probes all other hosts in parallel and returns response from the fastest one. Additionally, this PR makes the call to cluster config cache async on updating cluster.

karol-kokoszka added bug Something isn't working high labels Oct 18, 2024

karol-kokoszka added this to the 3.4 milestone Oct 18, 2024

karol-kokoszka self-assigned this Oct 22, 2024

karol-kokoszka linked a pull request Oct 22, 2024 that will close this issue

Parallelized cluster hosts discovering + async config cache update #4077

Open

karol-kokoszka modified the milestones: 3.4, 3.4.1 Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster update API call may timeout when one of the known hosts is down #4074

Cluster update API call may timeout when one of the known hosts is down #4074

karol-kokoszka commented Oct 18, 2024 •

edited

Loading

karol-kokoszka commented Oct 18, 2024 •

edited

Loading

karol-kokoszka commented Oct 21, 2024

Cluster update API call may timeout when one of the known hosts is down #4074

Cluster update API call may timeout when one of the known hosts is down #4074

Comments

karol-kokoszka commented Oct 18, 2024 • edited Loading

karol-kokoszka commented Oct 18, 2024 • edited Loading

karol-kokoszka commented Oct 21, 2024

karol-kokoszka commented Oct 18, 2024 •

edited

Loading

karol-kokoszka commented Oct 18, 2024 •

edited

Loading