-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"gocql: no hosts available in the pool" cannot recover without restart #915
Comments
After we close a connection due to too many timeouts, when would the connection be made again? I do see gocql wait for event from control connection to reconnect. But if no node is perceived as down by the control node, and it never emits a 'node up' event, would gocql try to recreate the connections? |
@Zariel gocql closes a connection when it takes too long to respond, when would the connection be reconnected? I may miss some of the code, looks like it will only reconnect when it receives "Node Up" event? |
It should be periodically try to dial nodes that are down, if you can reproduce and build with Also if all the hosts are down, then the host list should be reset to the initial hosts and they will be attempted. |
I think the root cause of the problem here is:
To fix this, we need to mark the host as DOWN, when we close all the connections on that host. |
I simulated network connectivity issues using iptables:
After that upon queries I start to receive this error message: After deleting the rule ( |
Running a single node of Cassandra (on mac so excuse lack of iptables), able to query host as expected when its running. Kill -9 the process, gocql receives no host down notification and discovers the control connection is dead and marks the host down. Restarting Cassandra the driver reconnects and queries can continue. Repeated with STOP/CONT and had the same results. Can you rerun your test with |
I think |
Do you know of a way to replicate on OSX or reliably for regression tests? |
When this error occurs I have had to cycle the container as there is no good way to reconnect after this event. There must be a better way but this error has caused so many problems |
seems that setting |
+1 Any fixes for this issue ?. In single node cassandra this issue is consistently reproducible |
Constantly seeing this issue on single-node Cassandra cluster. |
We have this problem too. Have had it since at least Oct of 2018. I had to put in special code to detect this and cause the application to be recycled. Our apps call gocql under very heavy load (millions an hour requests). For a long time the recycle occurred once a month or so. But the loads have increased recently and it is now up to every days or even shorter. We have 4 instances running. It is not always all 4 instances at the same time. Out of nowhere with no proceeding errors logged we all of a sudden get this in a running instance. The only fix is to restart the instance. Occasionally other instances will shortly run into the problem to and need restarting. If we restart there is no problem at all connecting to the nodes. The main problem we have is during the detection and restarting period there is a significant slow down in processing as the instance is not available. |
+ Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked. + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above. + Made shutdown of Memendra thread safe + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915 + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format() +
+ Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked. + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above. + Made shutdown of Memendra thread safe + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915 + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format() +
+ Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked. + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above. + Made shutdown of Memendra thread safe + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915 + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format() + Replace Bind(...) calls by Query in new goCQL version (simplify the code)
+ Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked. + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above. + Made shutdown of Memendra thread safe + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915 + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format() + Replace Bind(...) calls by Query in new goCQL version (simplify the code)
+ Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked. + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above. + Made shutdown of Memendra thread safe + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915 + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format() + Replace Bind(...) calls by Query in new goCQL version (simplify the code)
I ran into a similar issue a couple of days ago and followed @balta2ar instructions to try and recreate the situation. And surprisingly gocql reconnected after removing the rule ( Here are the logs:
Any other ideas on how to replicate this issue? |
Running into this while using gocql to write to AWS Keyspaces. |
I'm also experiencing this issue extremely frequently with AWS Keyspaces. It appears to be entirely due to state of the process. Other processes can access Keyspaces without issue, and restarting an affected process immediately resolves the issue. I'm up for helping to debug however I can. |
it seems like it happens because of |
There is |
I'm trying to reproduce this using docker-compose on Mac. I'm using the cap_add:
- NET_ADMIN to my docker-compose.yaml so I can use When I execute
So far, so good. Occasionally, an automatic reconnect is apparently attempted:
But when I then remove the rule via
However I know that the issue is not fixed entirely, because we had an outage due to it recently. Now I don't know how to reproduce it. |
I've used both changes in those PRs and they didn't solve my issue, at least the way I'm testing it. |
If the test-case I've done in #915 (comment) (https://github.com/PierreF/gocql-cluster-issue) is the issue @zhixinwen reported, then #1682 fix this issue. At least after an upgrade to gocql v1.4.0, I no longer have an issue. |
Still having the same issue with Azure Cosmos DB Cassandra Interface (private endpoint), even after upgrading to v1.5.2 |
For those who are still experiencing this issue, can you confirm whether you're using |
I do use |
I'm using |
We are considering deprecation of |
Just updated from |
Hello @jack-at-circle cluster.DisableInitialHostLookup=false
cluster.PoolConfig = gocql.PoolConfig{
HostSelectionPolicy: gocql.SingleHostReadyPolicy(gocql.RoundRobinHostPolicy()),
} Hope it helps |
We're currently using that config @josuebrunel but are still losing connections after a day or two : ( |
@jack-at-circle Oh I see. I'm using type ReconnectionPolicy struct {
maxRetries int
delay time.Duration
}
func (rp ReconnectionPolicy) GetInterval(currentRetry int) time.Duration {
slog.Info("Trying to reconnect to db instance", "currentRetry", currentRetry)
return rp.delay
}
func (rp ReconnectionPolicy) GetMaxRetries() int {
return rp.maxRetries
}
cluster.ReconnectionPolicy = ReconnectionPolicy{maxRetries: 4, delay: time.Duration(400) * time.Millisecond} |
I had this problem with v1.6.0 (actually replace github.com/gocql/gocql => github.com/scylladb/gocql v1.12.0) It was a configuration isuue. (thankfully only in dev environment) My nodes have 2 ip addrs 100.* (extip) and 10.*(ip). Config made use of extip addrs but system.local returns ip addrs. And firewall only allowed extip addrs Log
|
This issue is seven years old as of yesterday 🎂 - quite a few changes have been made to the connection failure recovery logic since it was created. Might I suggest closing this issue and documenting each remaining failure case in its own (new) issue? |
We are still facing this issue. Restarting the client fixes it. Otherwise every other day we see same error. What is the fix for this problem? |
@bhuvi-maheshwari This issue has been closed for months. If you've confirmed your configuration is correct and suspect a bug, I'd suggest opening a new issue, including all versions, configuration, and log messages specific to your situation. |
As issue suggests problem is that |
@dkropachev Thanks for your reply. Then how do we fix it? Do you know if upgrading the version of gocql will fix it? If yes then what version? We are currently on v1.6.0 |
@bhuvi-maheshwari , to answer that question I need to know you network infrastructure and driver configuration. |
@dkropachev Here is the cassandra setup. We use github.com/gocql/gocql v1.6.0 version. Go version: 1.21.1 clusterConfig := gocql.NewCluster(ips...) |
We had a cluster degrades due to increased traffic load, and on the client side we see "gocql: no hosts available in the pool" error for ~80% of the requests.
The error rate is very consistent, and the C* cluster is healthy except the load. We try to reduce the load on the cluster, but the error is still consistent. It lasts for two hours.
We restart the client, and the errors go away immediately.
Also one thing interesting is that after we restart the gocql client, we also see an immediate drop on coordinator latency and traffic on the C* cluster (Yes, client restarts first, then the latency and traffic on C* drops, not the other way around). It may be due to upstream users stop retry after we restart the server and no longer see "gocql: no hosts available in the pool", but we are not sure about the root cause yet.
The text was updated successfully, but these errors were encountered: