You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current logic for updating the cluster in Scylla Manager includes validating host connectivity. This process involves the Manager querying one of the known hosts (stored in the ManagerDB) to discover other cluster hosts via the GET /storage_service/host_id API and mapping them to the correct data center using GET /snitch/datacenter.
After discovering live hosts, the Scylla Manager Agent API on each host is pinged to validate whether the Scylla Manager Agent is installed and responsive.
This is necessary because the Scylla Manager server requires the Scylla Manager Agent to be present on every node in order to execute all supported tasks.
Unfortunately, the logic for discovering all nodes in the cluster involves iterating over all the nodes saved in the database. The Manager keeps trying until it receives a response from the Scylla Manager Agent on one of the nodes. The default timeout for this operation is 30 seconds, with a few retries. The Manager moves on to the next node in the queue if the current node fails to respond.
If the first node in the list is down or unavailable, it blocks the cluster update, forcing it to wait until the server proceeds to a valid node.
This means that even if the cluster update eventually succeeds, the API call to the Manager server might timeout. As a result, the Scylla Manager CLI, which is a wrapper over the Manager server API, will also report an error when updating the cluster.
The Manager server should call all the currently known hosts in parallel and proceed with the first one that responds, canceling calls to the other nodes once a valid response is received.
The text was updated successfully, but these errors were encountered:
We should consider to still keep the coordinator-host as the first to check, but there is definitely no need to wait 30sec for the response + retry-backoff it couple of time.
The timeout for the first call can be lowered to 5sec without the retries.
Later on, if the first fails, all nodes should be checked in parallel and the first response should cause other ones to cancel the context.
…date async
Fixes#4074
This PR makes the cluster hosts discovery more robust, as when the cluster.host is DOWN,
it probes all other hosts in parallel and returns response from the fastest one.
Additionally, this PR makes the call to cluster config cache async on updating cluster.
The current logic for updating the cluster in Scylla Manager includes validating host connectivity. This process involves the Manager querying one of the known hosts (stored in the ManagerDB) to discover other cluster hosts via the GET /storage_service/host_id API and mapping them to the correct data center using GET /snitch/datacenter.
After discovering live hosts, the Scylla Manager Agent API on each host is pinged to validate whether the Scylla Manager Agent is installed and responsive.
This is necessary because the Scylla Manager server requires the Scylla Manager Agent to be present on every node in order to execute all supported tasks.
Unfortunately, the logic for discovering all nodes in the cluster involves iterating over all the nodes saved in the database. The Manager keeps trying until it receives a response from the Scylla Manager Agent on one of the nodes. The default timeout for this operation is 30 seconds, with a few retries. The Manager moves on to the next node in the queue if the current node fails to respond.
If the first node in the list is down or unavailable, it blocks the cluster update, forcing it to wait until the server proceeds to a valid node.
The default manager client has the following timeout configuration: Scylla Manager Client Timeout.
This means that even if the cluster update eventually succeeds, the API call to the Manager server might timeout. As a result, the Scylla Manager CLI, which is a wrapper over the Manager server API, will also report an error when updating the cluster.
The Manager server should call all the currently known hosts in parallel and proceed with the first one that responds, canceling calls to the other nodes once a valid response is received.
The text was updated successfully, but these errors were encountered: