Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster update API call may timeout when one of the known hosts is down #4074

Open
karol-kokoszka opened this issue Oct 18, 2024 · 2 comments · May be fixed by #4077
Open

Cluster update API call may timeout when one of the known hosts is down #4074

karol-kokoszka opened this issue Oct 18, 2024 · 2 comments · May be fixed by #4077
Assignees
Labels
bug Something isn't working high
Milestone

Comments

@karol-kokoszka
Copy link
Collaborator

karol-kokoszka commented Oct 18, 2024

The current logic for updating the cluster in Scylla Manager includes validating host connectivity. This process involves the Manager querying one of the known hosts (stored in the ManagerDB) to discover other cluster hosts via the GET /storage_service/host_id API and mapping them to the correct data center using GET /snitch/datacenter.

After discovering live hosts, the Scylla Manager Agent API on each host is pinged to validate whether the Scylla Manager Agent is installed and responsive.

This is necessary because the Scylla Manager server requires the Scylla Manager Agent to be present on every node in order to execute all supported tasks.

Unfortunately, the logic for discovering all nodes in the cluster involves iterating over all the nodes saved in the database. The Manager keeps trying until it receives a response from the Scylla Manager Agent on one of the nodes. The default timeout for this operation is 30 seconds, with a few retries. The Manager moves on to the next node in the queue if the current node fails to respond.

If the first node in the list is down or unavailable, it blocks the cluster update, forcing it to wait until the server proceeds to a valid node.

The default manager client has the following timeout configuration: Scylla Manager Client Timeout.

This means that even if the cluster update eventually succeeds, the API call to the Manager server might timeout. As a result, the Scylla Manager CLI, which is a wrapper over the Manager server API, will also report an error when updating the cluster.

The Manager server should call all the currently known hosts in parallel and proceed with the first one that responds, canceling calls to the other nodes once a valid response is received.

@karol-kokoszka karol-kokoszka added bug Something isn't working high labels Oct 18, 2024
@karol-kokoszka karol-kokoszka added this to the 3.4 milestone Oct 18, 2024
@karol-kokoszka
Copy link
Collaborator Author

karol-kokoszka commented Oct 18, 2024

Moreover, cluster update is chained with forcing config-cache to update its data about this cluster.
It must be async operation.

return s.notifyChangeListener(ctx, changeEvent)

@karol-kokoszka
Copy link
Collaborator Author

We should consider to still keep the coordinator-host as the first to check, but there is definitely no need to wait 30sec for the response + retry-backoff it couple of time.
The timeout for the first call can be lowered to 5sec without the retries.
Later on, if the first fails, all nodes should be checked in parallel and the first response should cause other ones to cancel the context.

@karol-kokoszka karol-kokoszka self-assigned this Oct 22, 2024
karol-kokoszka added a commit that referenced this issue Oct 22, 2024
…date async

Fixes #4074

This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN,
it probes all other hosts in parallel and returns response from the fastest one.
Additionally, this PR makes the call to cluster config cache async on updating cluster.
@karol-kokoszka karol-kokoszka modified the milestones: 3.4, 3.4.1 Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant