Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

proxy: Maintain communication state with a heartbeat #91

Merged
merged 1 commit into from
Jul 27, 2018

Conversation

sboeuf
Copy link

@sboeuf sboeuf commented Jul 16, 2018

We are trying to disable the feature keepalive introduced by Yamux
both on the client (kata-proxy) and server (kata-agent) sides. The
reason being we don't want to get Yamux errors in case we pause the
VM. The proxy side has already been disabled and we are about to
disable it on the agent side too. Problem is, we sometimes run into
a weird issue where the communication between the proxy and the agent
hangs.

It's related to the emulated serial port created by Qemu which is not
getting out of its sleeping loop for some cases. This issue is still
under investigation, but a simple fix is to actually write more data
to the serial port to wake it up. This workaround is needed since
disabling Yamux keepalive solves several issues, particularly one
related to our long running soak tests.

That's why this commit enables a simple "keepalive" feature, except
it does not check for any error. The idea being to simply sending
something out through this serial port.

Fixes #70

Signed-off-by: Sebastien Boeuf [email protected]

@sboeuf
Copy link
Author

sboeuf commented Jul 16, 2018

Relates to kata-containers/agent#263

@opendev-zuul
Copy link

opendev-zuul bot commented Jul 17, 2018

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@bergwolf
Copy link
Member

The reason being we don't want to get Yamux errors in case we pause the VM.

If that is the root cause, would this commit help? hashicorp/yamux@4c2fe0d

We already have it at the runtime side but I don't think we have it in proxy or agent.

@sboeuf
Copy link
Author

sboeuf commented Jul 17, 2018

@bergwolf I doubt the patch will change anything to the case explained by @devimc here: kata-containers/agent#231 (comment)
The thing is, I am not sure pausing the container will end up into an ErrSessionShutdown, hence the patch you're pointing will have no effect.

@devimc
Copy link

devimc commented Jul 17, 2018

CI is not happy

proxy.go:1::warning: file is not gofmted with -s (gofmt)

@sboeuf sboeuf force-pushed the introduce_heartbeat_yamux branch from df28eaf to bdb0526 Compare July 17, 2018 14:40
@opendev-zuul
Copy link

opendev-zuul bot commented Jul 17, 2018

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@sboeuf
Copy link
Author

sboeuf commented Jul 17, 2018

recheck

We are trying to disable the feature keepalive introduced by Yamux
both on the client (kata-proxy) and server (kata-agent) sides. The
reason being we don't want to get Yamux errors in case we pause the
VM. The proxy side has already been disabled and we are about to
disable it on the agent side too. Problem is, we sometimes run into
a weird issue where the communication between the proxy and the agent
hangs.

It's related to the emulated serial port created by Qemu which is not
getting out of its sleeping loop for some cases. This issue is still
under investigation, but a simple fix is to actually write more data
to the serial port to wake it up. This workaround is needed since
disabling Yamux keepalive solves several issues, particularly one
related to our long running soak tests.

That's why this commit enables a simple "keepalive" feature, except
it does not check for any error. The idea being to simply sending
something out through this serial port.

Fixes kata-containers#70

Signed-off-by: Sebastien Boeuf <[email protected]>
@sboeuf sboeuf force-pushed the introduce_heartbeat_yamux branch from bdb0526 to 063d58f Compare July 17, 2018 22:41
@codecov
Copy link

codecov bot commented Jul 17, 2018

Codecov Report

Merging #91 into master will increase coverage by 1.06%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #91      +/-   ##
=========================================
+ Coverage   33.33%   34.4%   +1.06%     
=========================================
  Files           2       2              
  Lines         240     250      +10     
=========================================
+ Hits           80      86       +6     
- Misses        149     151       +2     
- Partials       11      13       +2
Impacted Files Coverage Δ
proxy.go 29.62% <60%> (+1.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e2a93d...063d58f. Read the comment docs.

@opendev-zuul
Copy link

opendev-zuul bot commented Jul 18, 2018

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@opendev-zuul
Copy link

opendev-zuul bot commented Jul 18, 2018

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

Copy link
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as a working workaround for the issues seen over on the keepAlive PR at kata-containers/agent#263, this is fine:
lgtm

Just a couple of minor questions.

session.Ping()

// 1 Hz heartbeat
time.Sleep(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting, I suspect 1s might be a good balance between too much traffic and keeping the link alive (waking out of freeze). For reference, I believe the default keepAlive timing we are effectively replacing was 30s.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, I replaced the default 30s with 1s as I thought this was appropriate.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my biggest concern here is how much overhead add this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by overhead ? Amount of data we're sending to the agent ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory consumption, cpu usage, bottle necks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the cpu usage, but could you elaborate a bit more on memory consumption and bottle necks ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some tests and lgtm

func serve(servConn io.ReadWriteCloser, proto, addr string, results chan error) (net.Listener, error) {
sessionConfig := yamux.DefaultConfig()
// Disable keepAlive since we don't know how much time a container can be paused
sessionConfig.EnableKeepAlive = false
sessionConfig.ConnectionWriteTimeout = time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you set this - the default is 10s in the Yamux configs I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, the ping command will wait for 10s to realize a write failed, and I thought it would be better to reduce this to 1s as we're not killing the communication if such error now happens. If we keep this to 10s by default, we might end up with cases where no heartbeat will be sent during 11s.

@grahamwhaley
Copy link
Contributor

A note for @chavafg @cboylan - I see we have a Zuul fail on the CI list here - but if I click the 'details' link it takes me off to what looks like the Zuul frontpage (https://zuul.openstack.org/). Will we eventually have the hot links take us off to a nice gui summary for the failed build? Just wondering :-)

@devimc
Copy link

devimc commented Jul 19, 2018

@sboeuf let me try something different, I'll enable keep alive in both sides (agent - proxy) but using a long timeout

@sboeuf
Copy link
Author

sboeuf commented Jul 19, 2018

@devimc well, using a long timeout is not appropriate because we'll hit the same issue and it'll be longer to unblock the hanging communication. This does not sounds like a viable option IMO.

@grahamwhaley
Copy link
Contributor

I happened to do that yesterday whilst testing for the hangout. As well as bumping the timeout on the keepalive, maybe we want to bump the connectionwritetimeout as well/instead - as iirc it was actually the connection write timeout on the keepalive ping that was failing, not the ping itself per-se ?

@sboeuf
Copy link
Author

sboeuf commented Jul 19, 2018

@grahamwhaley bumping the write timeout will only postpone the moment where the failure will happen (in case the VM is paused).

@grahamwhaley
Copy link
Contributor

Ah, yes - true! I had my 'very slow runtime' failure case in mind.

@devimc
Copy link

devimc commented Jul 19, 2018

@sboeuf enabling keepAlive in both sides, works for me

kata-proxy

diff --git a/proxy.go b/proxy.go
index 802d80e..705c18d 100644
--- a/proxy.go
+++ b/proxy.go
@@ -50,7 +50,9 @@ var proxyLog = logrus.New()
 func serve(servConn io.ReadWriteCloser, proto, addr string, results chan error) (net.Listener, error) {
        sessionConfig := yamux.DefaultConfig()
        // Disable keepAlive since we don't know how much time a container can be paused
-       sessionConfig.EnableKeepAlive = false
+       sessionConfig.KeepAliveInterval = 10 * time.Second
+       sessionConfig.ConnectionWriteTimeout = time.Hour * 24 * 365
+
        session, err := yamux.Client(servConn, sessionConfig)
        if err != nil {
                return nil, err

kata-agent

diff --git a/channel.go b/channel.go
index 624c81a..bd6b115 100644
--- a/channel.go
+++ b/channel.go
@@ -13,6 +13,7 @@ import (
        "os"
        "path/filepath"
        "strings"
+       "time"
 
        "github.com/hashicorp/yamux"
        "github.com/mdlayher/vsock"
@@ -164,6 +165,8 @@ func (c *serialChannel) listen() (net.Listener, error) {
        config := yamux.DefaultConfig()
 
        config.LogOutput = yamuxWriter{}
+       config.KeepAliveInterval = 10 * time.Second
+       config.ConnectionWriteTimeout = time.Hour * 24 * 365
 
        // Initialize Yamux server.
        session, err := yamux.Server(c.serialConn, config)

// This function is meant to run in a go routine since it will send ping
// commands every second. It behaves as a heartbeat to maintain a proper
// communication state with the Yamux server in the agent.
func heartBeat(session *yamux.Session) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yamux config.KeepAliveInterval already do this, we just need to increase config.ConnectionWriteTimeout

@sboeuf
Copy link
Author

sboeuf commented Jul 19, 2018

@devimc of course it works, you have a heartbeat beating every 10s with a write timeout of 1 year. So basically, you're not running into the hang issue because something is written every 10s, which maintain the connection alive, and at the same time, if this heartbeat hangs because the VM has been paused, it will not generate an error before it reaches one year.
But I feel that using one year here is sort of weird, and I don't think we should use the keepalive at all.
@bergwolf @grahamwhaley WDYT ? That'd be great if we could take a decision on this so that we unblock the metrics.

@devimc
Copy link

devimc commented Jul 19, 2018

lgtm

@cboylan
Copy link

cboylan commented Jul 19, 2018

A note for @chavafg @cboylan - I see we have a Zuul fail on the CI list here - but if I click the 'details'
link it takes me off to what looks like the Zuul frontpage (https://zuul.openstack.org/). Will we eventually
have the hot links take us off to a nice gui summary for the failed build? Just wondering :-)

@grahamwhaley this is a current known deficiency and part of the reason we are leaving distinct comments with logs. The current details page will get you to the live status page which is great when jobs are running but less so once they complete and report.

The plan to correct this is to support GitHub's new status check api, https://help.github.com/articles/about-status-checks/, and report details in that way. This work hasn't started yet. If anyone is interested in helping I'd be happy to help get them started. Otherwise I'd expect it to happen in the medium-term future.

@sboeuf
Copy link
Author

sboeuf commented Jul 20, 2018

@bergwolf could you give your opinion on this please ?

@bergwolf
Copy link
Member

@sboeuf I'm still hesitating on this one. Please see my comments on the agent PR

session, err := yamux.Client(servConn, sessionConfig)
if err != nil {
return nil, err
}

// Start the heartbeat in a separate go routine
go heartBeat(session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a temporary fix, could you add a little more explanation in the code (and ideally a link to an issue so we don't forget about this)?

@bergwolf
Copy link
Member

bergwolf commented Jul 27, 2018

LGTM. Thanks for the explanations @sboeuf !

Approved with PullApprove

@sboeuf sboeuf merged commit c416c9f into kata-containers:master Jul 27, 2018
@egernst egernst mentioned this pull request Aug 23, 2018
lifupan pushed a commit to lifupan/agent that referenced this pull request Sep 14, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

Depends-on: github.com/kata-containers/proxy#91

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
lifupan added a commit to lifupan/agent that referenced this pull request Oct 15, 2018
Since the Yamux's keepalive has been disabled both
on the server and the client side, and this brings
a weird issue where the communication between the
proxy and the agent hangs. The same issue has been
fixed in kata proxy by:
"kata-containers/proxy#91".

This commit just cherry-pick that patch here to
fix the same issue on kata builtin proxy.

Fixes: kata-containers#396

Signed-off-by: fupan <[email protected]>
jshachm pushed a commit to jshachm/agent that referenced this pull request Nov 22, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

Depends-on: github.com/kata-containers/proxy#91

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
jshachm pushed a commit to jshachm/agent that referenced this pull request Nov 22, 2018
Since the Yamux's keepalive has been disabled both
on the server and the client side, and this brings
a weird issue where the communication between the
proxy and the agent hangs. The same issue has been
fixed in kata proxy by:
"kata-containers/proxy#91".

This commit just cherry-pick that patch here to
fix the same issue on kata builtin proxy.

Fixes: kata-containers#396

Signed-off-by: fupan <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

yamux: keepalive failed: i/o deadline reached
6 participants