proxy: Maintain communication state with a heartbeat #91

sboeuf · 2018-07-16T23:17:51Z

We are trying to disable the feature keepalive introduced by Yamux
both on the client (kata-proxy) and server (kata-agent) sides. The
reason being we don't want to get Yamux errors in case we pause the
VM. The proxy side has already been disabled and we are about to
disable it on the agent side too. Problem is, we sometimes run into
a weird issue where the communication between the proxy and the agent
hangs.

It's related to the emulated serial port created by Qemu which is not
getting out of its sleeping loop for some cases. This issue is still
under investigation, but a simple fix is to actually write more data
to the serial port to wake it up. This workaround is needed since
disabling Yamux keepalive solves several issues, particularly one
related to our long running soak tests.

That's why this commit enables a simple "keepalive" feature, except
it does not check for any error. The idea being to simply sending
something out through this serial port.

Fixes #70

Signed-off-by: Sebastien Boeuf [email protected]

sboeuf · 2018-07-16T23:18:30Z

Relates to kata-containers/agent#263

opendev-zuul · 2018-07-17T00:53:11Z

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

kata-runsh : SUCCESS in 46m 25s
kata-runsh-fedora-28 : FAILURE in 31m 19s

bergwolf · 2018-07-17T03:01:37Z

The reason being we don't want to get Yamux errors in case we pause the VM.

If that is the root cause, would this commit help? hashicorp/yamux@4c2fe0d

We already have it at the runtime side but I don't think we have it in proxy or agent.

sboeuf · 2018-07-17T05:07:14Z

@bergwolf I doubt the patch will change anything to the case explained by @devimc here: kata-containers/agent#231 (comment)
The thing is, I am not sure pausing the container will end up into an ErrSessionShutdown, hence the patch you're pointing will have no effect.

devimc · 2018-07-17T12:05:18Z

CI is not happy

proxy.go:1::warning: file is not gofmted with -s (gofmt)

opendev-zuul · 2018-07-17T16:11:09Z

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

kata-runsh : SUCCESS in 48m 56s
kata-runsh-fedora-28 : RETRY_LIMIT in 7m 30s

sboeuf · 2018-07-17T20:49:10Z

recheck

We are trying to disable the feature keepalive introduced by Yamux both on the client (kata-proxy) and server (kata-agent) sides. The reason being we don't want to get Yamux errors in case we pause the VM. The proxy side has already been disabled and we are about to disable it on the agent side too. Problem is, we sometimes run into a weird issue where the communication between the proxy and the agent hangs. It's related to the emulated serial port created by Qemu which is not getting out of its sleeping loop for some cases. This issue is still under investigation, but a simple fix is to actually write more data to the serial port to wake it up. This workaround is needed since disabling Yamux keepalive solves several issues, particularly one related to our long running soak tests. That's why this commit enables a simple "keepalive" feature, except it does not check for any error. The idea being to simply sending something out through this serial port. Fixes kata-containers#70 Signed-off-by: Sebastien Boeuf <[email protected]>

codecov · 2018-07-17T22:41:59Z

Codecov Report

Merging #91 into master will increase coverage by 1.06%.
The diff coverage is 60%.

@@            Coverage Diff            @@
##           master     #91      +/-   ##
=========================================
+ Coverage   33.33%   34.4%   +1.06%     
=========================================
  Files           2       2              
  Lines         240     250      +10     
=========================================
+ Hits           80      86       +6     
- Misses        149     151       +2     
- Partials       11      13       +2

Impacted Files	Coverage Δ
proxy.go	`29.62% <60%> (+1.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e2a93d...063d58f. Read the comment docs.

opendev-zuul · 2018-07-18T00:07:28Z

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

kata-runsh : SUCCESS in 51m 14s
kata-runsh-fedora-28 : RETRY_LIMIT in 15m 34s

opendev-zuul · 2018-07-18T01:06:34Z

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

kata-runsh : SUCCESS in 48m 14s
kata-runsh-fedora-28 : RETRY_LIMIT in 15m 58s

grahamwhaley

I think as a working workaround for the issues seen over on the keepAlive PR at kata-containers/agent#263, this is fine:
lgtm

Just a couple of minor questions.

grahamwhaley · 2018-07-19T10:07:18Z

proxy.go

+		session.Ping()
+
+		// 1 Hz heartbeat
+		time.Sleep(time.Second)


Just noting, I suspect 1s might be a good balance between too much traffic and keeping the link alive (waking out of freeze). For reference, I believe the default keepAlive timing we are effectively replacing was 30s.

Yes exactly, I replaced the default 30s with 1s as I thought this was appropriate.

my biggest concern here is how much overhead add this?

What do you mean by overhead ? Amount of data we're sending to the agent ?

memory consumption, cpu usage, bottle necks

I get the cpu usage, but could you elaborate a bit more on memory consumption and bottle necks ?

I ran some tests and lgtm

grahamwhaley · 2018-07-19T10:07:48Z

proxy.go

 func serve(servConn io.ReadWriteCloser, proto, addr string, results chan error) (net.Listener, error) {
 	sessionConfig := yamux.DefaultConfig()
 	// Disable keepAlive since we don't know how much time a container can be paused
 	sessionConfig.EnableKeepAlive = false
+	sessionConfig.ConnectionWriteTimeout = time.Second


Any particular reason you set this - the default is 10s in the Yamux configs I think.

By default, the ping command will wait for 10s to realize a write failed, and I thought it would be better to reduce this to 1s as we're not killing the communication if such error now happens. If we keep this to 10s by default, we might end up with cases where no heartbeat will be sent during 11s.

grahamwhaley · 2018-07-19T10:11:58Z

A note for @chavafg @cboylan - I see we have a Zuul fail on the CI list here - but if I click the 'details' link it takes me off to what looks like the Zuul frontpage (https://zuul.openstack.org/). Will we eventually have the hot links take us off to a nice gui summary for the failed build? Just wondering :-)

devimc · 2018-07-19T13:27:59Z

@sboeuf let me try something different, I'll enable keep alive in both sides (agent - proxy) but using a long timeout

sboeuf · 2018-07-19T13:34:18Z

@devimc well, using a long timeout is not appropriate because we'll hit the same issue and it'll be longer to unblock the hanging communication. This does not sounds like a viable option IMO.

grahamwhaley · 2018-07-19T13:34:45Z

I happened to do that yesterday whilst testing for the hangout. As well as bumping the timeout on the keepalive, maybe we want to bump the connectionwritetimeout as well/instead - as iirc it was actually the connection write timeout on the keepalive ping that was failing, not the ping itself per-se ?

sboeuf · 2018-07-19T14:37:43Z

@grahamwhaley bumping the write timeout will only postpone the moment where the failure will happen (in case the VM is paused).

grahamwhaley · 2018-07-19T14:42:45Z

Ah, yes - true! I had my 'very slow runtime' failure case in mind.

devimc · 2018-07-19T15:09:35Z

@sboeuf enabling keepAlive in both sides, works for me

kata-proxy

diff --git a/proxy.go b/proxy.go
index 802d80e..705c18d 100644
--- a/proxy.go
+++ b/proxy.go
@@ -50,7 +50,9 @@ var proxyLog = logrus.New()
 func serve(servConn io.ReadWriteCloser, proto, addr string, results chan error) (net.Listener, error) {
        sessionConfig := yamux.DefaultConfig()
        // Disable keepAlive since we don't know how much time a container can be paused
-       sessionConfig.EnableKeepAlive = false
+       sessionConfig.KeepAliveInterval = 10 * time.Second
+       sessionConfig.ConnectionWriteTimeout = time.Hour * 24 * 365
+
        session, err := yamux.Client(servConn, sessionConfig)
        if err != nil {
                return nil, err

kata-agent

diff --git a/channel.go b/channel.go
index 624c81a..bd6b115 100644
--- a/channel.go
+++ b/channel.go
@@ -13,6 +13,7 @@ import (
        "os"
        "path/filepath"
        "strings"
+       "time"
 
        "github.com/hashicorp/yamux"
        "github.com/mdlayher/vsock"
@@ -164,6 +165,8 @@ func (c *serialChannel) listen() (net.Listener, error) {
        config := yamux.DefaultConfig()
 
        config.LogOutput = yamuxWriter{}
+       config.KeepAliveInterval = 10 * time.Second
+       config.ConnectionWriteTimeout = time.Hour * 24 * 365
 
        // Initialize Yamux server.
        session, err := yamux.Server(c.serialConn, config)

devimc · 2018-07-19T15:18:07Z

proxy.go

+// This function is meant to run in a go routine since it will send ping
+// commands every second. It behaves as a heartbeat to maintain a proper
+// communication state with the Yamux server in the agent.
+func heartBeat(session *yamux.Session) {


yamux config.KeepAliveInterval already do this, we just need to increase config.ConnectionWriteTimeout

sboeuf · 2018-07-19T15:18:53Z

@devimc of course it works, you have a heartbeat beating every 10s with a write timeout of 1 year. So basically, you're not running into the hang issue because something is written every 10s, which maintain the connection alive, and at the same time, if this heartbeat hangs because the VM has been paused, it will not generate an error before it reaches one year.
But I feel that using one year here is sort of weird, and I don't think we should use the keepalive at all.
@bergwolf @grahamwhaley WDYT ? That'd be great if we could take a decision on this so that we unblock the metrics.

devimc · 2018-07-19T18:25:21Z

lgtm

cboylan · 2018-07-19T18:28:44Z

A note for @chavafg @cboylan - I see we have a Zuul fail on the CI list here - but if I click the 'details'
link it takes me off to what looks like the Zuul frontpage (https://zuul.openstack.org/). Will we eventually
have the hot links take us off to a nice gui summary for the failed build? Just wondering :-)

@grahamwhaley this is a current known deficiency and part of the reason we are leaving distinct comments with logs. The current details page will get you to the live status page which is great when jobs are running but less so once they complete and report.

The plan to correct this is to support GitHub's new status check api, https://help.github.com/articles/about-status-checks/, and report details in that way. This work hasn't started yet. If anyone is interested in helping I'd be happy to help get them started. Otherwise I'd expect it to happen in the medium-term future.

sboeuf · 2018-07-20T19:26:56Z

@bergwolf could you give your opinion on this please ?

bergwolf · 2018-07-22T04:06:42Z

@sboeuf I'm still hesitating on this one. Please see my comments on the agent PR

jodh-intel · 2018-07-23T14:10:35Z

proxy.go

 	session, err := yamux.Client(servConn, sessionConfig)
 	if err != nil {
 		return nil, err
 	}

+	// Start the heartbeat in a separate go routine
+	go heartBeat(session)


Since this is a temporary fix, could you add a little more explanation in the code (and ideally a link to an issue so we don't forget about this)?

bergwolf · 2018-07-27T02:46:39Z

LGTM. Thanks for the explanations @sboeuf !

yamux client runs in the proxy side, sometimes the client is handling other requests and it's not able to response to the ping sent by the server and the communication is closed. To avoid IO timeouts in the communication between agent and proxy, keep alive should be disabled. Depends-on: github.com/kata-containers/proxy#91 fixes kata-containers/proxy#70 fixes kata-containers#231 Signed-off-by: Julio Montes <[email protected]>

Since the Yamux's keepalive has been disabled both on the server and the client side, and this brings a weird issue where the communication between the proxy and the agent hangs. The same issue has been fixed in kata proxy by: "kata-containers/proxy#91". This commit just cherry-pick that patch here to fix the same issue on kata builtin proxy. Fixes: kata-containers#396 Signed-off-by: fupan <[email protected]>

yamux client runs in the proxy side, sometimes the client is handling other requests and it's not able to response to the ping sent by the server and the communication is closed. To avoid IO timeouts in the communication between agent and proxy, keep alive should be disabled. Depends-on: github.com/kata-containers/proxy#91 fixes kata-containers/proxy#70 fixes kata-containers#231 Signed-off-by: Julio Montes <[email protected]>

Since the Yamux's keepalive has been disabled both on the server and the client side, and this brings a weird issue where the communication between the proxy and the agent hangs. The same issue has been fixed in kata proxy by: "kata-containers/proxy#91". This commit just cherry-pick that patch here to fix the same issue on kata builtin proxy. Fixes: kata-containers#396 Signed-off-by: fupan <[email protected]>

sboeuf requested review from devimc, grahamwhaley and bergwolf July 16, 2018 23:17

sboeuf force-pushed the introduce_heartbeat_yamux branch from df28eaf to bdb0526 Compare July 17, 2018 14:40

sboeuf force-pushed the introduce_heartbeat_yamux branch from bdb0526 to 063d58f Compare July 17, 2018 22:41

sboeuf mentioned this pull request Jul 18, 2018

yamux: disable yamux keep alive in server channel kata-containers/agent#263

Merged

grahamwhaley approved these changes Jul 19, 2018

View reviewed changes

devimc reviewed Jul 19, 2018

View reviewed changes

jodh-intel reviewed Jul 23, 2018

View reviewed changes

sboeuf merged commit c416c9f into kata-containers:master Jul 27, 2018

egernst mentioned this pull request Aug 23, 2018

Stable 1.1.1 candidate #104

Merged

lifupan mentioned this pull request Oct 15, 2018

Sometimes the IO's stdout will hang on shimv2 with kata builtin proxy kata-containers/agent#396

Closed

lifupan mentioned this pull request Oct 15, 2018

agent: Fix the issue of stdout hang on builtin proxy kata-containers/agent#397

Merged

proxy: Maintain communication state with a heartbeat #91

proxy: Maintain communication state with a heartbeat #91

Conversation

sboeuf commented Jul 16, 2018

sboeuf commented Jul 16, 2018

opendev-zuul bot commented Jul 17, 2018

bergwolf commented Jul 17, 2018

sboeuf commented Jul 17, 2018

devimc commented Jul 17, 2018

opendev-zuul bot commented Jul 17, 2018

sboeuf commented Jul 17, 2018

codecov bot commented Jul 17, 2018 • edited Loading

Codecov Report

opendev-zuul bot commented Jul 18, 2018

opendev-zuul bot commented Jul 18, 2018

grahamwhaley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grahamwhaley commented Jul 19, 2018

devimc commented Jul 19, 2018

sboeuf commented Jul 19, 2018

grahamwhaley commented Jul 19, 2018

sboeuf commented Jul 19, 2018

grahamwhaley commented Jul 19, 2018

devimc commented Jul 19, 2018

Choose a reason for hiding this comment

sboeuf commented Jul 19, 2018

devimc commented Jul 19, 2018

cboylan commented Jul 19, 2018

sboeuf commented Jul 20, 2018

bergwolf commented Jul 22, 2018

Choose a reason for hiding this comment

bergwolf commented Jul 27, 2018 • edited by amshinde Loading

codecov bot commented Jul 17, 2018 •

edited

Loading

bergwolf commented Jul 27, 2018 •

edited by amshinde

Loading