Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request is sent to inactive node #195

Open
CEmocca opened this issue Jan 7, 2025 · 1 comment
Open

Request is sent to inactive node #195

CEmocca opened this issue Jan 7, 2025 · 1 comment

Comments

@CEmocca
Copy link

CEmocca commented Jan 7, 2025

Hi,

We have 6 ScyllaDB nodes in the cluster. Recently, we pulled out one of the servers to do some maintenance. We expected the requests to be sent to 5 active nodes and skip inactive node with RoundRobin policy. However, the requests are still being sent to the inactive node.

Let's assume that we have

  • node_1 - active
  • node_2 - active
  • node_3 - active
  • node_4 - inactive
  • node_5 - active
  • node_6 - active

In the configuration, we set node_1, node_2, node_3, node_5, and node_6 in contact points.
However, the driver queries all peers from the initialization steps via (SELECT * FROM system.peers). The inactive node is returned and initialized with the Up status.
Code:

Then, the round-robin policy tries to send the request to this node for some period (Randomly around 10 - 120 minutes). The requests all failed until x minutes, and then the problem was gone. (I assume it should be status changed event being fired)

Debug log

2025-01-07T14:10:59.267467Z DEBUG ThreadId(03) cdrs_tokio::cluster::control_connection: 95: Establishing new control connection...
2025-01-07T14:10:59.267705Z DEBUG ThreadId(03) cdrs_tokio::cluster::topology::node: 233: Establishing new connection to node...
2025-01-07T14:10:59.374566Z DEBUG ThreadId(11) cdrs_tokio::cluster::control_connection: 121: Established new control connection.
2025-01-07T14:10:59.496255Z DEBUG ThreadId(11) cdrs_tokio::cluster::metadata_builder: 27: Copying contact point. node_info=NodeInfo { host_id: 23db827d-4f95-4d67-b642-7409d0076ab9, broadcast_rpc_address: node_1:9042, broadcast_address: None, datacenter: "th", rack: "rack2" }
2025-01-07T14:10:59.496332Z DEBUG ThreadId(11) cdrs_tokio::cluster::metadata_builder: 27: Copying contact point. node_info=NodeInfo { host_id: 290c2a93-9d24-4dab-9f2a-32a7bbeb3804, broadcast_rpc_address: node_2:9042, broadcast_address: None, datacenter: "th", rack: "rack3" }
2025-01-07T14:10:59.496355Z DEBUG ThreadId(11) cdrs_tokio::cluster::metadata_builder: 27: Copying contact point. node_info=NodeInfo { host_id: bea093f2-a031-421e-8e57-4b09d4855d98, broadcast_rpc_address: node_3:9042, broadcast_address: None, datacenter: "th", rack: "rack3" }
2025-01-07T14:10:59.496376Z DEBUG ThreadId(11) cdrs_tokio::cluster::metadata_builder: 30: Adding new node. node_info=NodeInfo { host_id: c82e11de-ac57-4257-848c-6e0bfee14bfd, broadcast_rpc_address: node_4:9042, broadcast_address: None, datacenter: "th", rack: "rack2" }
2025-01-07T14:10:59.496397Z DEBUG ThreadId(11) cdrs_tokio::cluster::metadata_builder: 27: Copying contact point. node_info=NodeInfo { host_id: 022c3be3-c7e0-43c3-869d-5a7df5a1e805, broadcast_rpc_address: node_5:9042, broadcast_address: None, datacenter: "th", rack: "rack1" }
2025-01-07T14:10:59.496455Z DEBUG ThreadId(11) cdrs_tokio::cluster::metadata_builder: 27: Copying contact point. node_info=NodeInfo { host_id: 9ca4d73c-7fb0-443b-908b-ba155dc0e0cc, broadcast_rpc_address: node_6:9042, broadcast_address: None, datacenter: "th", rack: "rack1" }

In my opinion, adding a new node should not start with UP status.
I have forked the repo and changed the initial status to Unknown.
master...CEmocca:cdrs-tokio:master

This solves the issue in my case. But I'm unsure if it is the right way to do it in general. I think the status event listener should handle this case.

What do you think?

@krojew
Copy link
Owner

krojew commented Jan 10, 2025

Nodes with unknown status are ignored by default, so we don't incur the penalty of sending requests to potentially downed ones. They are set to up after a topology event is received, proving they are indeed up. In essence, this mechanism prevents the effect you are seeing from occurring all the time through the lifetime of the client, at the cost of occurring once on startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants