Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] #1643

Merged
merged 6 commits into from
Feb 5, 2021

Conversation

timbertson-zd
Copy link
Contributor

@timbertson-zd timbertson-zd commented Feb 2, 2021

Since deploying v1.27.x, we've had a small but noticeable uptick in stuck maxwells - no events flowing, but maxwell itself is running with no errors, still reporting metrics etc.

My main suspect is #1548. That disables binlog replicator's dead connection logic because it clashes with maxwell's own functionality. But maxwell's functionality is pretty weak here - it only knows about active failures (i.e. exceptions), if the connection simply goes quiet it won't notice.

After reading shyiko/mysql-binlog-connector-java#118 I think maxwell should use binlog-connector's heartbeating. We still don't want to enable BC's keepalive thread (because it conflicts with our reconnect logic), but we can enable heartbeating and then reimplement the heartbeat detection as part of our existing dead connection detection.

/cc @zendesk/goanna

Copy link
Contributor

@lukestephenson-zendesk lukestephenson-zendesk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok.

@osheroff
Copy link
Collaborator

osheroff commented Feb 3, 2021

I like this approach, but do keep in mind that this is theory-based programming: losing the server CX and not noticing is a possible cause but perhaps not the only one; if you can catch a node red-handed please try to get a jstack dump of it to confirm.

Regardless, turning on master heartbeating seems like a pretty good idea anyway. I'll put code-specific comments inline.

@timbertson-zd
Copy link
Contributor Author

if you can catch a node red-handed please try to get a jstack dump of it to confirm.

Yeah, we're working on that for next time. Definitely agree that it's not necessarily the cause but it hopefully eliminates one possible cause. And it'll be useful data to see how often this triggers, once we deploy it.

@timbertson-zd
Copy link
Contributor Author

OK, updated with just a replication_heartbeat boolean to turn it on/off, it's enabled by default.

docs/docs/config.md Outdated Show resolved Hide resolved
LOGGER.warn(
"Last binlog event seen " + lastEventAge + "ms ago, exceeding " + maxAllowedEventAge + "ms allowance " +
"(" + binlogHeartbeatInterval + " * " + binlogHeartbeatIntervalAllowance + ")");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you managed to trigger this codepath (and the ensuing reconnection logic) locally? Like by say, turning on maxwell's expectation of heartbeats but turning off the actual heartbeat mechanism?

I don't necessarily expect you to write an integration test here, probably too hard, but I'd like to know that you at least reproduced a synthetic run of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll hack up something locally once we're happy with the PR and make sure it hangs together 👍

@osheroff
Copy link
Collaborator

osheroff commented Feb 4, 2021

generally looking good.

@timbertson-zd
Copy link
Contributor Author

I too was wondering if there were any places where we might timeout inelegantly waiting for the connection to come up or something.

Actually, if we're reconnecting instead of dying, there's two edge cases - initial connection, and then on subsequent connection we'll still be remembering the last heartbeat (which is by definition too old 💥 ).

We really want to reset the lastEventSeenAt in onConnect. So I've broken it out into its own listener in a29cdaa, which can be both a Lifecycle and Event listener. The current logic was split between Replicator and EventListener, which was already a bit awkward.

(and that way "are heartbeats enabled" equates to "is the heartbeat monitor non-null", which feels cleaner)

@osheroff
Copy link
Collaborator

osheroff commented Feb 4, 2021

We really want to reset the lastEventSeenAt in onConnect. So I've broken it out into its own listener in a29cdaa, which can be both a Lifecycle and Event listener.

Great, looks much cleaner. I'm 👍 on this after you tell me you've run a test.

@timbertson-zd
Copy link
Contributor Author

Tested, it (now 😉 ) works as advertised:

With the setHeartbeatInterval commented out, we get:

16:13:08,932 WARN  BinlogConnectorLivenessMonitor - Last binlog event seen 5027ms ago, exceeding 5000ms allowance (5000 * 1.0)
16:13:08,932 INFO  BinlogConnectorReplicator - Binlog disconnected.
16:13:08,934 WARN  BinlogConnectorReplicator - replicator stopped at position: master.000043:19424584 -- restarting
16:13:08,944 INFO  BinaryLogClient - Connected to mysql.docker:3306 at master.000043/19424584 (sid:6379, cid:281)
16:13:08,944 INFO  BinlogConnectorReplicator - Binlog connected.

(I also adjusted the threshold locally because mysql was too chatty)

I had to add a disconnect() before reconnecting, since now the replicator might still think it's connected and raise the already connected exception. The implementation looks idempotent, and it didn't complain when I ran disconnect() a few times in a row so I think that's safe.

I also moved the config off replicator_heartbeat into binlog_heartbeat, and it gets used for all 3 mysql configs. It's a replication setting, but it shouldn't need to be configured separately if you're using split server roles.

@osheroff osheroff merged commit 613448d into master Feb 5, 2021
@timbertson-zd
Copy link
Contributor Author

I know it's been a while, but I finally remembered to check back for some historical data, and we've seen this log message 4x over the past month. We also haven't seen any recurrence of the initial symptoms (stuck instance) since we rolled this out in February, so it definitely seems to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants