Jaeger fragile in the face of cassandra failures #767

rbtcollins · 2018-04-05T04:49:01Z

We've been observing Jaeger components - specifically the collectors and query - fail to recover after Cassandra has any sort of outage. While Cassandra reliability is clearly not a topic for here ;), Jaeger's resilience to issues is.

What we see happen is this a stuck process which is only logging one thing even hours after the issue is fixed... gocql: no hosts available in the pool

What we'd like to see happen is a recovery after a few minutes without manual intervention. This is related to, but distinct from, #562

The text was updated successfully, but these errors were encountered:

burmanm · 2018-04-12T08:49:23Z

That sounds like a bug in the gocql though, not Jaeger directly.

yurishkuro · 2018-04-12T13:55:43Z

we have a fixit week coming up, & want to try upgrading the driver

rbtcollins · 2018-05-28T08:10:57Z

@burmanm whether it is a bug in gocql or not doesn't really matter:

if it is a gocql bug; until it is both fixed and the updated version is in use, Jaeger will still be suffering the issue, and so a tracking bug is relevant
if it isn't (or if gocql say they won't fix because of some reason), then a local workaround needs to be done to solve the issue.

Either way, a bug in this repo is relevant while the defect is present :).

jpkrohling · 2018-05-28T08:53:33Z

@rbtcollins are you still experiencing this? The driver has been updated as part of #829.

Edit: of course you are still experiencing this... we haven't released a version with the fix yet. Would you be able to run from master?

rbtcollins · 2018-06-04T00:14:04Z

I'll see about that, got a few things up in the air just now. Anything stopping doing a release?

black-adder · 2018-06-04T14:43:15Z

This was released as part of 1.5.0, let us know if it works

jpkrohling · 2018-06-11T10:35:55Z

I'm closing this one, but feel free to reopen if you are still experiencing this after 1.5.0

nyanshak · 2018-07-12T17:36:03Z

I'd like to re-open this as I'm seeing the same problem after 1.5.0.

Related gocql issue: apache/cassandra-gocql-driver#915

I see there is a setting in gocql ReconnectInterval:

	// If not zero, gocql attempt to reconnect known DOWN nodes in every ReconnectInterval.
	ReconnectInterval time.Duration

In our use case, it's definitely been triggered while resizing the cluster.

Assuming that setting works as advertised... I believe the suggestion would be to add a flag to jaeger for cassandra reconnect interval, and set the default to some reasonable value.

@jpkrohling

jpkrohling · 2018-07-13T08:10:07Z

I believe the suggestion would be to add a flag to jaeger for cassandra reconnect interval, and set the default to some reasonable value.

Would you like to contribute a patch?

* Fix jaegertracing#767 by enabling gocql setting `ReconnectInterval` to reconnect to down Cassandra hosts at a regular interval.

* Fix jaegertracing#767 by enabling gocql setting `ReconnectInterval` to reconnect to down Cassandra hosts at a regular interval. Signed-off-by: Brendan Shaklovitz <[email protected]>

nyanshak · 2018-07-13T16:34:01Z

@jpkrohling Sure 💯 Opened #934

* Make cassandra reconnect down hosts. * Fix #767 by enabling gocql setting `ReconnectInterval` to reconnect to down Cassandra hosts at a regular interval. Signed-off-by: Brendan Shaklovitz <[email protected]> * Add cassandra `ReconnectInterval` test. Signed-off-by: Brendan Shaklovitz <[email protected]>

jpkrohling closed this as completed Jun 11, 2018

pavolloffay reopened this Jul 13, 2018

nyanshak added a commit to nyanshak/jaeger that referenced this issue Jul 13, 2018

Make cassandra reconnect down hosts

b179246

* Fix jaegertracing#767 by enabling gocql setting `ReconnectInterval` to reconnect to down Cassandra hosts at a regular interval.

nyanshak mentioned this issue Jul 13, 2018

Add support for Cassandra reconnect interval #934

Merged

yurishkuro closed this as completed in #934 Jul 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger fragile in the face of cassandra failures #767

Jaeger fragile in the face of cassandra failures #767

rbtcollins commented Apr 5, 2018

burmanm commented Apr 12, 2018

yurishkuro commented Apr 12, 2018

rbtcollins commented May 28, 2018

jpkrohling commented May 28, 2018 •

edited

Loading

rbtcollins commented Jun 4, 2018

black-adder commented Jun 4, 2018

jpkrohling commented Jun 11, 2018

nyanshak commented Jul 12, 2018 •

edited

Loading

jpkrohling commented Jul 13, 2018

nyanshak commented Jul 13, 2018

Jaeger fragile in the face of cassandra failures #767

Jaeger fragile in the face of cassandra failures #767

Comments

rbtcollins commented Apr 5, 2018

burmanm commented Apr 12, 2018

yurishkuro commented Apr 12, 2018

rbtcollins commented May 28, 2018

jpkrohling commented May 28, 2018 • edited Loading

rbtcollins commented Jun 4, 2018

black-adder commented Jun 4, 2018

jpkrohling commented Jun 11, 2018

nyanshak commented Jul 12, 2018 • edited Loading

jpkrohling commented Jul 13, 2018

nyanshak commented Jul 13, 2018

jpkrohling commented May 28, 2018 •

edited

Loading

nyanshak commented Jul 12, 2018 •

edited

Loading