BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] #1643

timbertson-zd · 2021-02-02T23:15:16Z

Since deploying v1.27.x, we've had a small but noticeable uptick in stuck maxwells - no events flowing, but maxwell itself is running with no errors, still reporting metrics etc.

My main suspect is #1548. That disables binlog replicator's dead connection logic because it clashes with maxwell's own functionality. But maxwell's functionality is pretty weak here - it only knows about active failures (i.e. exceptions), if the connection simply goes quiet it won't notice.

After reading shyiko/mysql-binlog-connector-java#118 I think maxwell should use binlog-connector's heartbeating. We still don't want to enable BC's keepalive thread (because it conflicts with our reconnect logic), but we can enable heartbeating and then reimplement the heartbeat detection as part of our existing dead connection detection.

/cc @zendesk/goanna

lukestephenson-zendesk

Seems ok.

osheroff · 2021-02-03T01:44:56Z

I like this approach, but do keep in mind that this is theory-based programming: losing the server CX and not noticing is a possible cause but perhaps not the only one; if you can catch a node red-handed please try to get a jstack dump of it to confirm.

Regardless, turning on master heartbeating seems like a pretty good idea anyway. I'll put code-specific comments inline.

src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java

timbertson-zd · 2021-02-03T04:03:03Z

if you can catch a node red-handed please try to get a jstack dump of it to confirm.

Yeah, we're working on that for next time. Definitely agree that it's not necessarily the cause but it hopefully eliminates one possible cause. And it'll be useful data to see how often this triggers, once we deploy it.

…rtbeat is missing

timbertson-zd · 2021-02-04T00:50:58Z

OK, updated with just a replication_heartbeat boolean to turn it on/off, it's enabled by default.

docs/docs/config.md

src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java

osheroff · 2021-02-04T05:46:00Z

src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java

+				LOGGER.warn(
+					"Last binlog event seen " + lastEventAge + "ms ago, exceeding " + maxAllowedEventAge + "ms allowance " +
+					"(" + binlogHeartbeatInterval + " * " + binlogHeartbeatIntervalAllowance + ")");
+			}


have you managed to trigger this codepath (and the ensuing reconnection logic) locally? Like by say, turning on maxwell's expectation of heartbeats but turning off the actual heartbeat mechanism?

I don't necessarily expect you to write an integration test here, probably too hard, but I'd like to know that you at least reproduced a synthetic run of the code.

Yeah, I'll hack up something locally once we're happy with the PR and make sure it hangs together 👍

osheroff · 2021-02-04T05:46:18Z

generally looking good.

timbertson-zd · 2021-02-04T06:16:26Z

I too was wondering if there were any places where we might timeout inelegantly waiting for the connection to come up or something.

Actually, if we're reconnecting instead of dying, there's two edge cases - initial connection, and then on subsequent connection we'll still be remembering the last heartbeat (which is by definition too old 💥 ).

We really want to reset the lastEventSeenAt in onConnect. So I've broken it out into its own listener in a29cdaa, which can be both a Lifecycle and Event listener. The current logic was split between Replicator and EventListener, which was already a bit awkward.

(and that way "are heartbeats enabled" equates to "is the heartbeat monitor non-null", which feels cleaner)

osheroff · 2021-02-04T12:03:11Z

We really want to reset the lastEventSeenAt in onConnect. So I've broken it out into its own listener in a29cdaa, which can be both a Lifecycle and Event listener.

Great, looks much cleaner. I'm 👍 on this after you tell me you've run a test.

timbertson-zd · 2021-02-05T05:20:19Z

Tested, it (now 😉 ) works as advertised:

With the setHeartbeatInterval commented out, we get:

16:13:08,932 WARN  BinlogConnectorLivenessMonitor - Last binlog event seen 5027ms ago, exceeding 5000ms allowance (5000 * 1.0)
16:13:08,932 INFO  BinlogConnectorReplicator - Binlog disconnected.
16:13:08,934 WARN  BinlogConnectorReplicator - replicator stopped at position: master.000043:19424584 -- restarting
16:13:08,944 INFO  BinaryLogClient - Connected to mysql.docker:3306 at master.000043/19424584 (sid:6379, cid:281)
16:13:08,944 INFO  BinlogConnectorReplicator - Binlog connected.

(I also adjusted the threshold locally because mysql was too chatty)

I had to add a disconnect() before reconnecting, since now the replicator might still think it's connected and raise the already connected exception. The implementation looks idempotent, and it didn't complain when I ran disconnect() a few times in a row so I think that's safe.

I also moved the config off replicator_heartbeat into binlog_heartbeat, and it gets used for all 3 mysql configs. It's a replication setting, but it shouldn't need to be configured separately if you're using split server roles.

timbertson-zd · 2021-06-08T06:42:30Z

I know it's been a while, but I finally remembered to check back for some historical data, and we've seen this log message 4x over the past month. We also haven't seen any recurrence of the initial symptoms (stuck instance) since we rolled this out in February, so it definitely seems to help.

BinlogConnectorReplicator: add heartbeat detection

a2b5062

lukestephenson-zendesk approved these changes Feb 2, 2021

View reviewed changes

osheroff requested changes Feb 3, 2021

View reviewed changes

src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java Outdated Show resolved Hide resolved

src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java Show resolved Hide resolved

osheroff mentioned this pull request Feb 3, 2021

maxwell not reconnecting #1642

Closed

add replication_heartbeat config and reconnect transparently when hea…

39e60ee

…rtbeat is missing

timbertson-zd requested a review from osheroff February 4, 2021 01:46

osheroff reviewed Feb 4, 2021

View reviewed changes

docs/docs/config.md Outdated Show resolved Hide resolved

osheroff reviewed Feb 4, 2021

View reviewed changes

src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java Outdated Show resolved Hide resolved

osheroff reviewed Feb 4, 2021

View reviewed changes

moved binlog heartbeat monitor logic into its own class

a29cdaa

leave replication_heartbeat off by default

fee939c

osheroff approved these changes Feb 4, 2021

View reviewed changes

timbertson added 2 commits February 5, 2021 16:08

move replication_heartbeat config -> binlog_heartbeat

ba12332

BinlogConnectorReplicator: disconnect before attempting reconnect

b8ffe41

osheroff merged commit 613448d into master Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] #1643

BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] #1643

timbertson-zd commented Feb 2, 2021 •

edited

Loading

lukestephenson-zendesk left a comment

osheroff commented Feb 3, 2021

timbertson-zd commented Feb 3, 2021

timbertson-zd commented Feb 4, 2021

osheroff Feb 4, 2021

timbertson-zd Feb 4, 2021

osheroff commented Feb 4, 2021

timbertson-zd commented Feb 4, 2021

osheroff commented Feb 4, 2021

timbertson-zd commented Feb 5, 2021

timbertson-zd commented Jun 8, 2021

BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] #1643

BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] #1643

Conversation

timbertson-zd commented Feb 2, 2021 • edited Loading

lukestephenson-zendesk left a comment

Choose a reason for hiding this comment

osheroff commented Feb 3, 2021

timbertson-zd commented Feb 3, 2021

timbertson-zd commented Feb 4, 2021

osheroff Feb 4, 2021

Choose a reason for hiding this comment

timbertson-zd Feb 4, 2021

Choose a reason for hiding this comment

osheroff commented Feb 4, 2021

timbertson-zd commented Feb 4, 2021

osheroff commented Feb 4, 2021

timbertson-zd commented Feb 5, 2021

timbertson-zd commented Jun 8, 2021

timbertson-zd commented Feb 2, 2021 •

edited

Loading