-
Notifications
You must be signed in to change notification settings - Fork 55
proxy: Maintain communication state with a heartbeat #91
proxy: Maintain communication state with a heartbeat #91
Conversation
Relates to kata-containers/agent#263 |
Build failed (third-party-check pipeline) integration testing with
|
If that is the root cause, would this commit help? hashicorp/yamux@4c2fe0d We already have it at the runtime side but I don't think we have it in proxy or agent. |
@bergwolf I doubt the patch will change anything to the case explained by @devimc here: kata-containers/agent#231 (comment) |
CI is not happy
|
df28eaf
to
bdb0526
Compare
Build failed (third-party-check pipeline) integration testing with
|
recheck |
We are trying to disable the feature keepalive introduced by Yamux both on the client (kata-proxy) and server (kata-agent) sides. The reason being we don't want to get Yamux errors in case we pause the VM. The proxy side has already been disabled and we are about to disable it on the agent side too. Problem is, we sometimes run into a weird issue where the communication between the proxy and the agent hangs. It's related to the emulated serial port created by Qemu which is not getting out of its sleeping loop for some cases. This issue is still under investigation, but a simple fix is to actually write more data to the serial port to wake it up. This workaround is needed since disabling Yamux keepalive solves several issues, particularly one related to our long running soak tests. That's why this commit enables a simple "keepalive" feature, except it does not check for any error. The idea being to simply sending something out through this serial port. Fixes kata-containers#70 Signed-off-by: Sebastien Boeuf <[email protected]>
bdb0526
to
063d58f
Compare
Codecov Report
@@ Coverage Diff @@
## master #91 +/- ##
=========================================
+ Coverage 33.33% 34.4% +1.06%
=========================================
Files 2 2
Lines 240 250 +10
=========================================
+ Hits 80 86 +6
- Misses 149 151 +2
- Partials 11 13 +2
Continue to review full report at Codecov.
|
Build failed (third-party-check pipeline) integration testing with
|
Build failed (third-party-check pipeline) integration testing with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as a working workaround for the issues seen over on the keepAlive PR at kata-containers/agent#263, this is fine:
lgtm
Just a couple of minor questions.
session.Ping() | ||
|
||
// 1 Hz heartbeat | ||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting, I suspect 1s might be a good balance between too much traffic and keeping the link alive (waking out of freeze). For reference, I believe the default keepAlive timing we are effectively replacing was 30s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, I replaced the default 30s with 1s as I thought this was appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my biggest concern here is how much overhead add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by overhead ? Amount of data we're sending to the agent ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory consumption, cpu usage, bottle necks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the cpu usage, but could you elaborate a bit more on memory consumption and bottle necks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some tests and lgtm
func serve(servConn io.ReadWriteCloser, proto, addr string, results chan error) (net.Listener, error) { | ||
sessionConfig := yamux.DefaultConfig() | ||
// Disable keepAlive since we don't know how much time a container can be paused | ||
sessionConfig.EnableKeepAlive = false | ||
sessionConfig.ConnectionWriteTimeout = time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you set this - the default is 10s in the Yamux configs I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, the ping command will wait for 10s to realize a write failed, and I thought it would be better to reduce this to 1s as we're not killing the communication if such error now happens. If we keep this to 10s by default, we might end up with cases where no heartbeat will be sent during 11s.
A note for @chavafg @cboylan - I see we have a Zuul fail on the CI list here - but if I click the 'details' link it takes me off to what looks like the Zuul frontpage (https://zuul.openstack.org/). Will we eventually have the hot links take us off to a nice gui summary for the failed build? Just wondering :-) |
@sboeuf let me try something different, I'll enable keep alive in both sides (agent - proxy) but using a long timeout |
@devimc well, using a long timeout is not appropriate because we'll hit the same issue and it'll be longer to unblock the hanging communication. This does not sounds like a viable option IMO. |
I happened to do that yesterday whilst testing for the hangout. As well as bumping the timeout on the keepalive, maybe we want to bump the connectionwritetimeout as well/instead - as iirc it was actually the connection write timeout on the keepalive ping that was failing, not the ping itself per-se ? |
@grahamwhaley bumping the write timeout will only postpone the moment where the failure will happen (in case the VM is paused). |
Ah, yes - true! I had my 'very slow runtime' failure case in mind. |
@sboeuf enabling keepAlive in both sides, works for me kata-proxy diff --git a/proxy.go b/proxy.go
index 802d80e..705c18d 100644
--- a/proxy.go
+++ b/proxy.go
@@ -50,7 +50,9 @@ var proxyLog = logrus.New()
func serve(servConn io.ReadWriteCloser, proto, addr string, results chan error) (net.Listener, error) {
sessionConfig := yamux.DefaultConfig()
// Disable keepAlive since we don't know how much time a container can be paused
- sessionConfig.EnableKeepAlive = false
+ sessionConfig.KeepAliveInterval = 10 * time.Second
+ sessionConfig.ConnectionWriteTimeout = time.Hour * 24 * 365
+
session, err := yamux.Client(servConn, sessionConfig)
if err != nil {
return nil, err kata-agent diff --git a/channel.go b/channel.go
index 624c81a..bd6b115 100644
--- a/channel.go
+++ b/channel.go
@@ -13,6 +13,7 @@ import (
"os"
"path/filepath"
"strings"
+ "time"
"github.com/hashicorp/yamux"
"github.com/mdlayher/vsock"
@@ -164,6 +165,8 @@ func (c *serialChannel) listen() (net.Listener, error) {
config := yamux.DefaultConfig()
config.LogOutput = yamuxWriter{}
+ config.KeepAliveInterval = 10 * time.Second
+ config.ConnectionWriteTimeout = time.Hour * 24 * 365
// Initialize Yamux server.
session, err := yamux.Server(c.serialConn, config) |
// This function is meant to run in a go routine since it will send ping | ||
// commands every second. It behaves as a heartbeat to maintain a proper | ||
// communication state with the Yamux server in the agent. | ||
func heartBeat(session *yamux.Session) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yamux config.KeepAliveInterval
already do this, we just need to increase config.ConnectionWriteTimeout
@devimc of course it works, you have a heartbeat beating every 10s with a write timeout of 1 year. So basically, you're not running into the hang issue because something is written every 10s, which maintain the connection alive, and at the same time, if this heartbeat hangs because the VM has been paused, it will not generate an error before it reaches one year. |
lgtm |
@grahamwhaley this is a current known deficiency and part of the reason we are leaving distinct comments with logs. The current details page will get you to the live status page which is great when jobs are running but less so once they complete and report. The plan to correct this is to support GitHub's new status check api, https://help.github.com/articles/about-status-checks/, and report details in that way. This work hasn't started yet. If anyone is interested in helping I'd be happy to help get them started. Otherwise I'd expect it to happen in the medium-term future. |
@bergwolf could you give your opinion on this please ? |
@sboeuf I'm still hesitating on this one. Please see my comments on the agent PR |
session, err := yamux.Client(servConn, sessionConfig) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
// Start the heartbeat in a separate go routine | ||
go heartBeat(session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a temporary fix, could you add a little more explanation in the code (and ideally a link to an issue so we don't forget about this)?
LGTM. Thanks for the explanations @sboeuf ! |
yamux client runs in the proxy side, sometimes the client is handling other requests and it's not able to response to the ping sent by the server and the communication is closed. To avoid IO timeouts in the communication between agent and proxy, keep alive should be disabled. Depends-on: github.com/kata-containers/proxy#91 fixes kata-containers/proxy#70 fixes kata-containers#231 Signed-off-by: Julio Montes <[email protected]>
Since the Yamux's keepalive has been disabled both on the server and the client side, and this brings a weird issue where the communication between the proxy and the agent hangs. The same issue has been fixed in kata proxy by: "kata-containers/proxy#91". This commit just cherry-pick that patch here to fix the same issue on kata builtin proxy. Fixes: kata-containers#396 Signed-off-by: fupan <[email protected]>
yamux client runs in the proxy side, sometimes the client is handling other requests and it's not able to response to the ping sent by the server and the communication is closed. To avoid IO timeouts in the communication between agent and proxy, keep alive should be disabled. Depends-on: github.com/kata-containers/proxy#91 fixes kata-containers/proxy#70 fixes kata-containers#231 Signed-off-by: Julio Montes <[email protected]>
Since the Yamux's keepalive has been disabled both on the server and the client side, and this brings a weird issue where the communication between the proxy and the agent hangs. The same issue has been fixed in kata proxy by: "kata-containers/proxy#91". This commit just cherry-pick that patch here to fix the same issue on kata builtin proxy. Fixes: kata-containers#396 Signed-off-by: fupan <[email protected]>
We are trying to disable the feature keepalive introduced by Yamux
both on the client (kata-proxy) and server (kata-agent) sides. The
reason being we don't want to get Yamux errors in case we pause the
VM. The proxy side has already been disabled and we are about to
disable it on the agent side too. Problem is, we sometimes run into
a weird issue where the communication between the proxy and the agent
hangs.
It's related to the emulated serial port created by Qemu which is not
getting out of its sleeping loop for some cases. This issue is still
under investigation, but a simple fix is to actually write more data
to the serial port to wake it up. This workaround is needed since
disabling Yamux keepalive solves several issues, particularly one
related to our long running soak tests.
That's why this commit enables a simple "keepalive" feature, except
it does not check for any error. The idea being to simply sending
something out through this serial port.
Fixes #70
Signed-off-by: Sebastien Boeuf [email protected]