Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is keepalive supposed to work? #118

Closed
snaury opened this issue Oct 3, 2016 · 6 comments
Closed

How is keepalive supposed to work? #118

snaury opened this issue Oct 3, 2016 · 6 comments

Comments

@snaury
Copy link

snaury commented Oct 3, 2016

I'm wondering how keepalive is supposed to work, because it seems it doesn't:

  • MySQL server stops reading commands after COM_BINLOG_DUMP/_GTID packet, until mysql_binlog_send returns
  • Sending data on a socket is a very bad way to test whether connection is alive, e.g. having data in the send buffer actually disables TCP keepalive until send buffer is drained, i.e. it actually makes it impossible to know when the connection has failed in case there is system-wide TCP keepalive configuration
  • It would be very unusual to have a send side fail without a corresponding receive side failure as well, thus network problems would likely be manifested on the channel receive (in , not on a channel send
  • Having a keepalive thread spawning other random threads makes it very hard to track what's going on in the background, e.g. what is the status of the current reconnection attempt

Would it be possible to change the API to a more user-friendly workflow? Something like this:

  • Method connect would connect, authenticate and start binlog dumping, but return immediately otherwise
  • Method readEvent would (synchronously) read the next event, instead of a separate thread making callbacks, this way it would be easily to control and exceptions would naturally bubble-up to the application loop (this would also make it similar to BinaryLogFileReader)
  • In case of disconnect the keepalive setting would do a reconnect, in the same thread that called readEvent, again making it easier for exceptions to bubble-up
@shyiko
Copy link
Owner

shyiko commented Oct 4, 2016

Hi @snaury.

You're right. MySQL never actually reacts to COM_PING when in COM_BINLOG_DUMP/_GTID streaming mode. COM_PING is going to be accumulated in "receive buffer" on MySQL side which given default 1 minute keep-alive interval, 1 byte COM_PING and 512-1024kb rcvbuf* yields ~1-2 years to "fill up" (provided my 1am math is correct). Of course, all of this is net.ipv4.* & BinaryLogClient config dependent + (just like you said) effectively disables TCP keep-alive (though this one (arguably) isn't all that useful anyway considering the default value of 2 hours and no easy way (to my knowledge) of controlling it from java (without resorting to JNI) (among other things)).

A correct way (at least the way I see it) is to use HEARTBEAT event (which pretty much boils down to set @master_heartbeat_period=${rate * 1000000000} before COM_BINLOG_DUMP/_GTID). In this case, all keep-alive thread would need to do is to track the time of the last HEARTBEAT event and reestablish the connection if it hasn't been seen within "user-specified timeframe". This is something I was planning to include in 1.0.0 (the reason 1.0.0 wasn't released months ago is simply because I wanted to gather as much feedback (like this one) as possible to make sure everything is accounted for before the API freeze).

Regarding connect & better API - it's on the roadmap (#102 & #114).

A few notes that might or might not be useful:

"what is the status of the current reconnection attempt"

you can register LifecycleListener to get some visibility into what is going on at any point of time

"connect would connect, authenticate and start binlog dumping, but return immediately
otherwise"

This is what connect() (without timeout parameter) does right now provided keepAlive is off. When it's true, however, the behavior is Make BinaryLogClient#connect() more intuitive #102 :)

Let me know if something doesn't make sense or you have some other ideas. Thank you!

@snaury
Copy link
Author

snaury commented Oct 4, 2016

This is what connect() (without timeout parameter) does right now provided keepAlive is off.

Yes, I figured that and would probably do that. However due to event listener the logic becomes somewhat inverted. Consider the following example code, how I see it:

Event event;
boolean haveBufferSpace = true;
while (haveBufferSpace && (event = client.readEvent()) != null) {
    // complex event handling logic, batching, etc.
    // there may be some fatal exceptions due to which the program must audibly fail
}

With the current API I have to put the body of the while loop into the event listener, have complex buffer flushing logic in there as well, I have to effectively catch (Throwable e) because otherwise the client would simply log exceptions and continue as if nothing happened, then I have to remember this exception somewhere, call disconnect() and hope that connect() call would return promptly, and rethrow the exception that I remembered (provided connect() didn't throw some other exception for whatever reason). That's a lot of complexity for a synchronous case, when the loop I have above would have been trivial to convert into an EventListener style handling, not the other way around sadly.

@shyiko
Copy link
Owner

shyiko commented Oct 4, 2016

Like I said, the API is going to simplified in 1.0.0. Meanwhile, maybe you could (just an example) use a blocking queue to which EventListener would enqueue the events? In that case consumer (your while loop & all the logic) would be decoupled from the source (BinaryLogClient) + it wouldn't matter whether BinaryLogClient is running on separate thread or not. An actual code could look very much alike to what you specified above.

@shyiko
Copy link
Owner

shyiko commented Feb 28, 2017

Hey @snaury. I've just uploaded 0.10.0 to Maven Central. It contains proper keepAlive mechanism (using HEARTBEAT event described above). Anyway, I understand that you guys don't use built-in keepAlive but you still might find BinaryLogClient::heartbeatInterval useful. Take a look.

@shyiko shyiko closed this as completed Feb 28, 2017
@snaury
Copy link
Author

snaury commented Feb 28, 2017

Thank you! That's useful even when not using a keepalive thread, but you should be careful, because documentation states:

MASTER_HEARTBEAT_PERIOD sets the interval in seconds between replication heartbeats. Whenever the master's binary log is updated with an event, the waiting period for the next heartbeat is reset. interval is a decimal value having the range 0 to 4294967 seconds and a resolution in milliseconds; the smallest nonzero value is 0.001. Heartbeats are sent by the master only if there are no unsent events in the binary log file for a period longer than interval.

So you should update heartbeatLastSeen on any event, not just on heartbeats, because on busy databases there might be heartbeats at all, as far as I can understand.

@shyiko
Copy link
Owner

shyiko commented Feb 28, 2017

@snaury you're absolute right. Thank you! 0.10.1 deployed to Maven Central.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants