-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] #1643
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok.
I like this approach, but do keep in mind that this is theory-based programming: losing the server CX and not noticing is a possible cause but perhaps not the only one; if you can catch a node red-handed please try to get a Regardless, turning on master heartbeating seems like a pretty good idea anyway. I'll put code-specific comments inline. |
src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java
Outdated
Show resolved
Hide resolved
src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java
Show resolved
Hide resolved
Yeah, we're working on that for next time. Definitely agree that it's not necessarily the cause but it hopefully eliminates one possible cause. And it'll be useful data to see how often this triggers, once we deploy it. |
…rtbeat is missing
OK, updated with just a |
src/main/java/com/zendesk/maxwell/replication/BinlogConnectorReplicator.java
Outdated
Show resolved
Hide resolved
LOGGER.warn( | ||
"Last binlog event seen " + lastEventAge + "ms ago, exceeding " + maxAllowedEventAge + "ms allowance " + | ||
"(" + binlogHeartbeatInterval + " * " + binlogHeartbeatIntervalAllowance + ")"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you managed to trigger this codepath (and the ensuing reconnection logic) locally? Like by say, turning on maxwell's expectation of heartbeats but turning off the actual heartbeat mechanism?
I don't necessarily expect you to write an integration test here, probably too hard, but I'd like to know that you at least reproduced a synthetic run of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll hack up something locally once we're happy with the PR and make sure it hangs together 👍
generally looking good. |
Actually, if we're reconnecting instead of dying, there's two edge cases - initial connection, and then on subsequent connection we'll still be remembering the last heartbeat (which is by definition too old 💥 ). We really want to reset the lastEventSeenAt in (and that way "are heartbeats enabled" equates to "is the heartbeat monitor non-null", which feels cleaner) |
Great, looks much cleaner. I'm 👍 on this after you tell me you've run a test. |
Tested, it (now 😉 ) works as advertised: With the setHeartbeatInterval commented out, we get:
(I also adjusted the threshold locally because mysql was too chatty) I had to add a disconnect() before reconnecting, since now the replicator might still think it's connected and raise the already connected exception. The implementation looks idempotent, and it didn't complain when I ran I also moved the config off |
I know it's been a while, but I finally remembered to check back for some historical data, and we've seen this log message 4x over the past month. We also haven't seen any recurrence of the initial symptoms (stuck instance) since we rolled this out in February, so it definitely seems to help. |
Since deploying v1.27.x, we've had a small but noticeable uptick in stuck maxwells - no events flowing, but maxwell itself is running with no errors, still reporting metrics etc.
My main suspect is #1548. That disables binlog replicator's dead connection logic because it clashes with maxwell's own functionality. But maxwell's functionality is pretty weak here - it only knows about active failures (i.e. exceptions), if the connection simply goes quiet it won't notice.
After reading shyiko/mysql-binlog-connector-java#118 I think maxwell should use binlog-connector's heartbeating. We still don't want to enable BC's keepalive thread (because it conflicts with our reconnect logic), but we can enable heartbeating and then reimplement the heartbeat detection as part of our existing dead connection detection.
/cc @zendesk/goanna