Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaeger fragile in the face of cassandra failures #767

Closed
rbtcollins opened this issue Apr 5, 2018 · 10 comments
Closed

Jaeger fragile in the face of cassandra failures #767

rbtcollins opened this issue Apr 5, 2018 · 10 comments

Comments

@rbtcollins
Copy link
Contributor

We've been observing Jaeger components - specifically the collectors and query - fail to recover after Cassandra has any sort of outage. While Cassandra reliability is clearly not a topic for here ;), Jaeger's resilience to issues is.

What we see happen is this a stuck process which is only logging one thing even hours after the issue is fixed... gocql: no hosts available in the pool

What we'd like to see happen is a recovery after a few minutes without manual intervention. This is related to, but distinct from, #562

@burmanm
Copy link
Contributor

burmanm commented Apr 12, 2018

That sounds like a bug in the gocql though, not Jaeger directly.

@yurishkuro
Copy link
Member

we have a fixit week coming up, & want to try upgrading the driver

@rbtcollins
Copy link
Contributor Author

@burmanm whether it is a bug in gocql or not doesn't really matter:

  • if it is a gocql bug; until it is both fixed and the updated version is in use, Jaeger will still be suffering the issue, and so a tracking bug is relevant
  • if it isn't (or if gocql say they won't fix because of some reason), then a local workaround needs to be done to solve the issue.

Either way, a bug in this repo is relevant while the defect is present :).

@jpkrohling
Copy link
Contributor

jpkrohling commented May 28, 2018

@rbtcollins are you still experiencing this? The driver has been updated as part of #829.

Edit: of course you are still experiencing this... we haven't released a version with the fix yet. Would you be able to run from master?

@rbtcollins
Copy link
Contributor Author

I'll see about that, got a few things up in the air just now. Anything stopping doing a release?

@black-adder
Copy link
Contributor

This was released as part of 1.5.0, let us know if it works

@jpkrohling
Copy link
Contributor

I'm closing this one, but feel free to reopen if you are still experiencing this after 1.5.0

@nyanshak
Copy link
Contributor

nyanshak commented Jul 12, 2018

I'd like to re-open this as I'm seeing the same problem after 1.5.0.

Related gocql issue: apache/cassandra-gocql-driver#915

I see there is a setting in gocql ReconnectInterval:

	// If not zero, gocql attempt to reconnect known DOWN nodes in every ReconnectInterval.
	ReconnectInterval time.Duration

In our use case, it's definitely been triggered while resizing the cluster.

Assuming that setting works as advertised... I believe the suggestion would be to add a flag to jaeger for cassandra reconnect interval, and set the default to some reasonable value.

@jpkrohling

@pavolloffay pavolloffay reopened this Jul 13, 2018
@jpkrohling
Copy link
Contributor

I believe the suggestion would be to add a flag to jaeger for cassandra reconnect interval, and set the default to some reasonable value.

Would you like to contribute a patch?

nyanshak added a commit to nyanshak/jaeger that referenced this issue Jul 13, 2018
* Fix jaegertracing#767 by enabling gocql setting `ReconnectInterval` to reconnect to
down Cassandra hosts at a regular interval.
nyanshak added a commit to nyanshak/jaeger that referenced this issue Jul 13, 2018
* Fix jaegertracing#767 by enabling gocql setting `ReconnectInterval` to reconnect to
down Cassandra hosts at a regular interval.

Signed-off-by: Brendan Shaklovitz <[email protected]>
nyanshak added a commit to nyanshak/jaeger that referenced this issue Jul 13, 2018
* Fix jaegertracing#767 by enabling gocql setting `ReconnectInterval` to reconnect to
down Cassandra hosts at a regular interval.

Signed-off-by: Brendan Shaklovitz <[email protected]>
@nyanshak
Copy link
Contributor

@jpkrohling Sure 💯 Opened #934

yurishkuro pushed a commit that referenced this issue Jul 14, 2018
* Make cassandra reconnect down hosts.

* Fix #767 by enabling gocql setting `ReconnectInterval` to reconnect to
down Cassandra hosts at a regular interval.

Signed-off-by: Brendan Shaklovitz <[email protected]>

* Add cassandra `ReconnectInterval` test.

Signed-off-by: Brendan Shaklovitz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants