Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

yamux: keepalive failed: i/o deadline reached #231

Closed
devimc opened this issue May 2, 2018 · 19 comments · Fixed by kata-containers/proxy#71 or #263
Closed

yamux: keepalive failed: i/o deadline reached #231

devimc opened this issue May 2, 2018 · 19 comments · Fixed by kata-containers/proxy#71 or #263
Assignees
Labels
bug Incorrect behaviour highest-priority Critically urgent issue (must be resolved as soon as possible)

Comments

@devimc
Copy link

devimc commented May 2, 2018

Description of problem

Run a container with -ti and pause it for 45 seconds

Terminal 1

$ docker run --name foo -ti centos bash

Terminal 2

$ docker pause foo
$ sleep 45
$ docker unpause foo

Terminal 1 gets stuck after 45 second and next message is logged in the journal

[ERR] yamux: keepalive failed: i/o deadline reached
@devimc devimc added the bug Incorrect behaviour label May 2, 2018
@grahamwhaley
Copy link
Contributor

Ah, I think I have seen this recently, but not had time to track down why... currently, if you 'ignore' a container, it will stop responding and then die... good find! +1 for I think this is a real current issue!
Hmm, so we could add a test for this to the CI - but that would push out CI feedback loop time another 10 minutes.... so, maybe we won't ... but, if we ever get a 'long soak test' QA CI, that maybe runs on the master branch merges (rather than PRs), then we should add it there.

@grahamwhaley grahamwhaley added the highest-priority Critically urgent issue (must be resolved as soon as possible) label May 2, 2018
@grahamwhaley
Copy link
Contributor

I've added a P1 label as well...

@sboeuf
Copy link

sboeuf commented May 2, 2018

Dumb question, can we simply disable this timeout ?

@devimc
Copy link
Author

devimc commented May 2, 2018

@sboeuf ehmm, which timeout? is there a timeout in yamux?

@sboeuf
Copy link

sboeuf commented May 2, 2018

Take a look at this test on the yamux repo: https://github.com/hashicorp/yamux/blob/master/session_test.go#L797-L838

@grahamwhaley
Copy link
Contributor

yep, probably maybe we need to set the EnableKeepAlive? https://github.com/hashicorp/yamux/blob/d1caa6c97c9fc1cc9e83bbe34d0603f9ff0ce8bd/mux.go#L16

@sboeuf
Copy link

sboeuf commented May 2, 2018

Yes we need to make sure we understand the potential cases but a smart usage of those three flags: https://github.com/hashicorp/yamux/blob/d1caa6c97c9fc1cc9e83bbe34d0603f9ff0ce8bd/mux.go#L16-L27 should do the trick ;)

@WeiZhang555
Copy link
Member

Interesting find 👍

@egernst
Copy link
Member

egernst commented May 9, 2018

@WeiZhang555 @devimc -- since this is marked P1... seems we should probably do this with priority. Any takers?

@jingxiaolu
Copy link

This is interesting...
I checked code, currently in agent/channel.go, if we use serialChannel, config=nil when calling yamux.Server():

session, err := yamux.Server(c.serialConn, nil)

So yamux will use default config, which already EnableKeepAlive:
func DefaultConfig() *Config {
return &Config{
AcceptBacklog: 256,
EnableKeepAlive: true,
KeepAliveInterval: 30 * time.Second,
ConnectionWriteTimeout: 10 * time.Second,
MaxStreamWindowSize: initialStreamWindow,
LogOutput: os.Stderr,
}
}

@jingxiaolu
Copy link

Seems this is because we should update the vendor package agent in shim now.

Now in shim, when it calls NewAgentClient(), in current vendor it's agentDialer(), it will not create a yamux client func agentDialer(addr *url.URL) dialer {:
https://github.com/kata-containers/shim/blob/2457ccc107ac46d7105254c089b240523b1701c4/vendor/github.com/kata-containers/agent/protocols/client/client.go#L115-L129

While in current agent's protocol, it will return a yamux client...

func agentDialer(addr *url.URL, enableYamux bool) dialer {
var d dialer
switch addr.Scheme {
case vsockSocketScheme:
d = vsockDialer
case unixSocketScheme:
fallthrough
default:
d = unixDialer
}
if !enableYamux {
return d
}
// yamux dialer
return func(sock string, timeout time.Duration) (net.Conn, error) {
conn, err := d(sock, timeout)
if err != nil {
return nil, err
}
defer func() {
if err != nil {
conn.Close()
}
}()
var session *yamux.Session
session, err = yamux.Client(conn, nil)

I might give it a try on Monday...

@jingxiaolu
Copy link

Sorry the yamux client here is kata proxy but not kata shim, so shim may not necessary to upgrade the vendor agent.

In my TE, I can't reproduce the same phenomenon by leaving it for long time.
I can reproduce it with:

  1. run a new container with kata-runtime;
  2. send signal 19 SIGSTOP to proxy;
  3. wait for more than 30 seconds;
  4. send signal 18 SIGCONT to proxy;
  5. check journal log, get the same error print

@devimc
Copy link
Author

devimc commented May 14, 2018

@jingxiaolu thanks for debugging it, let me try to find another way to reproduce it

@devimc
Copy link
Author

devimc commented May 17, 2018

Hi I found an easy way to reproduce it

Terminal 1

$ docker run --name foo -ti centos bash

Terminal 2

$ docker pause foo
$ sleep 600
$ docker unpause foo

Terminal 1 gets stuck

@devimc
Copy link
Author

devimc commented May 17, 2018

actually 45 seconds is enough to reproduce this issue.
updating issue

@jshachm
Copy link
Member

jshachm commented May 17, 2018

docker pause finally also sends SIGSTOP to container process(here in kata is shim represents).
So we find a actual docker cli option to reproduce it.

Good job for both of you~~~
@jingxiaolu @devimc

@devimc devimc self-assigned this May 17, 2018
devimc pushed a commit to devimc/kata-proxy that referenced this issue May 17, 2018
We don't know how much time a container can be paused, hence connection
write timeout should be big enough to don't close the connection while
the container is paused.

fixes kata-containers/agent#231
fixes kata-containers#70

Signed-off-by: Julio Montes <[email protected]>
@devimc
Copy link
Author

devimc commented May 17, 2018

please take a look kata-containers/proxy#71

devimc pushed a commit to devimc/kata-proxy that referenced this issue May 17, 2018
We don't know how much time a container can be paused, hence connection
write timeout should be disabled to don't close the connection while
the container is paused.

fixes kata-containers/agent#231
fixes kata-containers#70

Signed-off-by: Julio Montes <[email protected]>
devimc pushed a commit to devimc/kata-proxy that referenced this issue May 17, 2018
We don't know how much time a container can be paused, hence connection
write timeout should be disabled to don't close the connection while
the container is paused.

fixes kata-containers/agent#231
fixes kata-containers#70

Signed-off-by: Julio Montes <[email protected]>
@jshachm
Copy link
Member

jshachm commented May 18, 2018

Making a mistake. Nowadays, we just support kata-runtime pause cli just pause hypervisor(vm) using qmp cmd. Not like what runc does, change cgroup freezer for all container process.

@devimc devimc reopened this Jun 9, 2018
@devimc
Copy link
Author

devimc commented Jun 9, 2018

see kata-containers/proxy#70

devimc pushed a commit to devimc/kata-agent that referenced this issue Jun 9, 2018
Disable yamux keep alive in channel and client.
yamux keep alive feature closes the connection with
proxy and agent when it's unable to ping them.

fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
devimc pushed a commit to devimc/kata-agent that referenced this issue Jun 9, 2018
Disable yamux keep alive in channel and client.
yamux keep alive feature closes the connection with
proxy and agent when it's unable to ping them.

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
devimc pushed a commit to devimc/kata-agent that referenced this issue Jun 11, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
@bergwolf bergwolf reopened this Jun 13, 2018
devimc pushed a commit to devimc/kata-agent that referenced this issue Jun 18, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
devimc pushed a commit to devimc/kata-agent that referenced this issue Jun 26, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
sboeuf pushed a commit to devimc/kata-agent that referenced this issue Jul 3, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
bergwolf added a commit to bergwolf/kata-runtime that referenced this issue Jul 5, 2018
I got following warning after upgrading dep tool:

Warning: the following project(s) have [[constraint]] stanzas in Gopkg.toml:

  ✗  github.com/hashicorp/yamux

However, these projects are not direct dependencies of the current project:
they are not imported in any .go files, nor are they in the 'required' list in
Gopkg.toml. Dep only applies [[constraint]] rules to direct dependencies, so
these rules will have no effect.

Either import/require packages from these projects so that they become direct
dependencies, or convert each [[constraint]] to an [[override]] to enforce rules
on these projects, if they happen to be transitive dependencies,

So let's convert constraint to override over yamux. In the meanwhile,
update the yamux vendor. Full commit list:

4c2fe0d (origin/b-consul-3040) Dont output keepalive error when the session is closed
f21aae5 Make sure to drain the timer channel on defer, and a clarifying comment
601ccd8 Make receive window update logic a bit cleaner
02d320c Uses timer pool in sendNoWait, like in waitForSendErr
cf433c5 window update unit test for partial read; benchmark large buffer
ca8dfd0 improve memory utilization in receive buffer, fix flow control
683f491 Fix race around read and write deadlines in Stream (#52)
40b86b2 Add public session CloseChan method (#44)

Note that commit 4c2fe0d might also help kata-containers/agent/issues/231.

Signed-off-by: Peng Tao <[email protected]>
lifupan added a commit to lifupan/agent that referenced this issue Jul 6, 2018
We don't know how much time a sandbox/container can be paused,
hence connection write timeout should be disabled to don't close
the connection while the sandbox/container is paused.

The same issue has been fixed in kata-proxy, for katabuiltin proxy,
it also needs this fix.

fixes kata-containers#231
fixes kata-containers#70

Signed-off-by: fupan <[email protected]>
sboeuf pushed a commit to devimc/kata-agent that referenced this issue Jul 18, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
sboeuf pushed a commit to devimc/kata-agent that referenced this issue Jul 18, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

Depends-on: github.com/kata-containers/proxy#91

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
egernst pushed a commit to kata-containers/proxy that referenced this issue Aug 23, 2018
We don't know how much time a container can be paused, hence connection
write timeout should be disabled to don't close the connection while
the container is paused.

fixes kata-containers/agent#231
fixes #70

Signed-off-by: Julio Montes <[email protected]>
egernst pushed a commit that referenced this issue Aug 30, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

Depends-on: github.com/kata-containers/proxy#91

fixes kata-containers/proxy#70
fixes #231

Signed-off-by: Julio Montes <[email protected]>
jshachm pushed a commit to jshachm/agent that referenced this issue Nov 22, 2018
yamux client runs in the proxy side, sometimes the client is handling
other requests and it's not able to response to the ping sent by the
server and the communication is closed. To avoid IO timeouts in the
communication between agent and proxy, keep alive should be disabled.

Depends-on: github.com/kata-containers/proxy#91

fixes kata-containers/proxy#70
fixes kata-containers#231

Signed-off-by: Julio Montes <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour highest-priority Critically urgent issue (must be resolved as soon as possible)
Projects
None yet
8 participants