How is keepalive supposed to work? #118

snaury · 2016-10-03T15:54:04Z

I'm wondering how keepalive is supposed to work, because it seems it doesn't:

MySQL server stops reading commands after COM_BINLOG_DUMP/_GTID packet, until mysql_binlog_send returns
Sending data on a socket is a very bad way to test whether connection is alive, e.g. having data in the send buffer actually disables TCP keepalive until send buffer is drained, i.e. it actually makes it impossible to know when the connection has failed in case there is system-wide TCP keepalive configuration
It would be very unusual to have a send side fail without a corresponding receive side failure as well, thus network problems would likely be manifested on the channel receive (in , not on a channel send
Having a keepalive thread spawning other random threads makes it very hard to track what's going on in the background, e.g. what is the status of the current reconnection attempt

Would it be possible to change the API to a more user-friendly workflow? Something like this:

Method connect would connect, authenticate and start binlog dumping, but return immediately otherwise
Method readEvent would (synchronously) read the next event, instead of a separate thread making callbacks, this way it would be easily to control and exceptions would naturally bubble-up to the application loop (this would also make it similar to BinaryLogFileReader)
In case of disconnect the keepalive setting would do a reconnect, in the same thread that called readEvent, again making it easier for exceptions to bubble-up

The text was updated successfully, but these errors were encountered:

shyiko · 2016-10-04T08:28:28Z

You're right. MySQL never actually reacts to COM_PING when in COM_BINLOG_DUMP/_GTID streaming mode. COM_PING is going to be accumulated in "receive buffer" on MySQL side which given default 1 minute keep-alive interval, 1 byte COM_PING and 512-1024kb rcvbuf* yields ~1-2 years to "fill up" (provided my 1am math is correct). Of course, all of this is net.ipv4.* & BinaryLogClient config dependent + (just like you said) effectively disables TCP keep-alive (though this one (arguably) isn't all that useful anyway considering the default value of 2 hours and no easy way (to my knowledge) of controlling it from java (without resorting to JNI) (among other things)).

A correct way (at least the way I see it) is to use HEARTBEAT event (which pretty much boils down to set @master_heartbeat_period=${rate * 1000000000} before COM_BINLOG_DUMP/_GTID). In this case, all keep-alive thread would need to do is to track the time of the last HEARTBEAT event and reestablish the connection if it hasn't been seen within "user-specified timeframe". This is something I was planning to include in 1.0.0 (the reason 1.0.0 wasn't released months ago is simply because I wanted to gather as much feedback (like this one) as possible to make sure everything is accounted for before the API freeze).

Regarding connect & better API - it's on the roadmap (#102 & #114).

A few notes that might or might not be useful:

"what is the status of the current reconnection attempt"

you can register LifecycleListener to get some visibility into what is going on at any point of time

"connect would connect, authenticate and start binlog dumping, but return immediately
otherwise"

This is what connect() (without timeout parameter) does right now provided keepAlive is off. When it's true, however, the behavior is Make BinaryLogClient#connect() more intuitive #102 :)

Let me know if something doesn't make sense or you have some other ideas. Thank you!

snaury · 2016-10-04T08:42:17Z

This is what connect() (without timeout parameter) does right now provided keepAlive is off.

Yes, I figured that and would probably do that. However due to event listener the logic becomes somewhat inverted. Consider the following example code, how I see it:

Event event;
boolean haveBufferSpace = true;
while (haveBufferSpace && (event = client.readEvent()) != null) {
    // complex event handling logic, batching, etc.
    // there may be some fatal exceptions due to which the program must audibly fail
}

With the current API I have to put the body of the while loop into the event listener, have complex buffer flushing logic in there as well, I have to effectively catch (Throwable e) because otherwise the client would simply log exceptions and continue as if nothing happened, then I have to remember this exception somewhere, call disconnect() and hope that connect() call would return promptly, and rethrow the exception that I remembered (provided connect() didn't throw some other exception for whatever reason). That's a lot of complexity for a synchronous case, when the loop I have above would have been trivial to convert into an EventListener style handling, not the other way around sadly.

shyiko · 2016-10-04T15:51:35Z

Like I said, the API is going to simplified in 1.0.0. Meanwhile, maybe you could (just an example) use a blocking queue to which EventListener would enqueue the events? In that case consumer (your while loop & all the logic) would be decoupled from the source (BinaryLogClient) + it wouldn't matter whether BinaryLogClient is running on separate thread or not. An actual code could look very much alike to what you specified above.

shyiko · 2017-02-28T17:05:06Z

Hey @snaury. I've just uploaded 0.10.0 to Maven Central. It contains proper keepAlive mechanism (using HEARTBEAT event described above). Anyway, I understand that you guys don't use built-in keepAlive but you still might find BinaryLogClient::heartbeatInterval useful. Take a look.

snaury · 2017-02-28T19:28:37Z

Thank you! That's useful even when not using a keepalive thread, but you should be careful, because documentation states:

MASTER_HEARTBEAT_PERIOD sets the interval in seconds between replication heartbeats. Whenever the master's binary log is updated with an event, the waiting period for the next heartbeat is reset. interval is a decimal value having the range 0 to 4294967 seconds and a resolution in milliseconds; the smallest nonzero value is 0.001. Heartbeats are sent by the master only if there are no unsent events in the binary log file for a period longer than interval.

So you should update heartbeatLastSeen on any event, not just on heartbeats, because on busy databases there might be heartbeats at all, as far as I can understand.

shyiko · 2017-02-28T20:15:38Z

@snaury you're absolute right. Thank you! 0.10.1 deployed to Maven Central.

shyiko mentioned this issue Feb 25, 2017

openChannel can block forever on channel.getInputStream().peek() #154

Closed

shyiko closed this as completed Feb 28, 2017

denisahearn mentioned this issue Jul 7, 2017

How to choose the right intervals for the "Keep Alive" and "Heartbeat" features #178

Open

timbertson-zd mentioned this issue Feb 2, 2021

BinlogConnectorReplicator: add heartbeat detection [MELINF-2251] zendesk/maxwell#1643

Merged

d-hrs mentioned this issue Jul 19, 2023

set heartbeat interval trocco-io/embulk-input-mysql_binlog#73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is keepalive supposed to work? #118

How is keepalive supposed to work? #118

snaury commented Oct 3, 2016

shyiko commented Oct 4, 2016

snaury commented Oct 4, 2016

shyiko commented Oct 4, 2016

shyiko commented Feb 28, 2017

snaury commented Feb 28, 2017

shyiko commented Feb 28, 2017

How is keepalive supposed to work? #118

How is keepalive supposed to work? #118

Comments

snaury commented Oct 3, 2016

shyiko commented Oct 4, 2016

snaury commented Oct 4, 2016

shyiko commented Oct 4, 2016

shyiko commented Feb 28, 2017

snaury commented Feb 28, 2017

shyiko commented Feb 28, 2017