-
Notifications
You must be signed in to change notification settings - Fork 374
vsock socket sporadically leaked #618
Comments
@maximilianriemensberger thanks for reporting this, I'll take a look |
after debugging this issue I found that the connection created to destroy the POD is leaked, removing the calll to diff --git a/vendor/github.com/kata-containers/agent/protocols/client/client.go b/vendor/github.com/kata-containers/agent/protocols/client/client.go
index 6e386fe..2862b4c 100644
--- a/vendor/github.com/kata-containers/agent/protocols/client/client.go
+++ b/vendor/github.com/kata-containers/agent/protocols/client/client.go
@@ -8,10 +8,12 @@ package client
import (
"context"
+ "fmt"
"net"
"net/url"
"strconv"
"strings"
+ "sync"
"time"
"github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc"
@@ -31,6 +33,9 @@ const (
)
var defaultDialTimeout = 15 * time.Second
+var defaultCloseTimeout = 5 * time.Second
+var vsockConnectionsLock sync.Mutex
+var vsockConnections []net.Conn
// AgentClient is an agent gRPC client connection wrapper for agentgrpc.AgentServiceClient
type AgentClient struct {
@@ -45,7 +50,26 @@ type yamuxSessionStream struct {
}
func (y *yamuxSessionStream) Close() error {
- return y.session.Close()
+ waitCh := y.session.CloseChan()
+ timeout := time.NewTimer(defaultCloseTimeout)
+
+ if err := y.Conn.Close(); err != nil {
+ return err
+ }
+
+ if err := y.session.Close(); err != nil {
+ return err
+ }
+
+ // block until session is really closed
+ select {
+ case <-waitCh:
+ timeout.Stop()
+ case <-timeout.C:
+ return fmt.Errorf("timeout waiting for session close")
+ }
+
+ return nil
}
type dialer func(string, time.Duration) (net.Conn, error)
@@ -196,6 +220,10 @@ func agentDialer(addr *url.URL, enableYamux bool) dialer {
}
func unixDialer(sock string, timeout time.Duration) (net.Conn, error) {
+ if strings.HasPrefix(sock, "unix:") {
+ sock = strings.Trim(sock, "unix:")
+ }
+
dialFunc := func() (net.Conn, error) {
return net.DialTimeout("unix", sock, timeout)
}
@@ -264,11 +292,16 @@ func commonDialer(timeout time.Duration, dialFunc func() (net.Conn, error), time
if !ok {
return nil, timeoutErrMsg
}
+ case <-t.C:
+ cancel <- true
+ return nil, timeoutErrMsg
}
return conn, nil
}
+var panicIfDial = false
+
func vsockDialer(sock string, timeout time.Duration) (net.Conn, error) {
cid, port, err := parseGrpcVsockAddr(sock)
if err != nil {
@@ -276,10 +309,33 @@ func vsockDialer(sock string, timeout time.Duration) (net.Conn, error) {
}
dialFunc := func() (net.Conn, error) {
- return vsock.Dial(cid, port)
+ conn, err := vsock.Dial(cid, port)
+ if err != nil {
+ return nil, err
+ }
+
+ vsockConnectionsLock.Lock()
+ if panicIfDial {
+ panic("panicIfDial")
+ }
+ vsockConnections = append(vsockConnections, conn)
+ vsockConnectionsLock.Unlock()
+ return conn, nil
}
timeoutErr := grpcStatus.Errorf(codes.DeadlineExceeded, "timed out connecting to vsock %d:%d", cid, port)
return commonDialer(timeout, dialFunc, timeoutErr)
}
+
+// CloseVSockConnections
+func CloseVSockConnections() error {
+ vsockConnectionsLock.Lock()
+ for _, c := range vsockConnections {
+ c.Close()
+ }
+ vsockConnections = []net.Conn{}
+ panicIfDial = true
+ vsockConnectionsLock.Unlock()
+ return nil
+} if someone else has any idea about what's going on, please take a look |
cc @kata-containers/runtime @bergwolf @sboeuf @jcvenegas |
Could you check if there are any dangling |
@bergwolf there are no processes running, I tried to close all connections before exit using the above patch, but that didn't work. |
@devimc Does |
Well, the good news (I guess) is that I can reproduce this fairly easily using the above script example (thanks @maximilianriemensberger ). I will keep digging for the rest of the day. One oddity to note is that I never (so far) see more than 3 or 4 vsocks being listed by |
I had another peak. No real new info, but I'll drop some details of what I see here. The script I ran, based on the original above, is: #!/bin/bash
set -x
set -e
sudo modprobe -i vhost_vsock
ps -eo pid,user,args > ps_pre.txt
lsmod > lsmod_pre.txt
sudo lsof > lsof_pre.txt
for ((i=1; i<=100; i++)); do
echo "# Run $(printf "%3d\n" $i)"
docker run --runtime kata-runtime -it --rm ubuntu bash -c 'true'
sleep 2
lsmod | grep ^vhost_vsock
ss -ip --vsock
done
ps -eo pid,user,args > ps_post.txt
lsmod > lsmod_post.txt
sudo lsof > lsof_post.txt Running that (on a Fedora machine with kernel 4.17.3-100.fc27.x86_64), I almost always see 1 to 3 vsocks reported by
You can see we have the Trying to work this out a bit more, I had a dig around in
nope.
|
Hi @grahamwhaley, I have finally managed to reproduce this and will let you know what I find. |
Excellent, thanks @stefanha . and, welcome back :-) |
I can confirm that sockets are leaking. Using packet capture (via the vsockmon module) I can see that it happens for sockets that aren't shut down by the guest before it terminates. This is a race condition. I'll try to find a minimal reproducer. A vhost_vsock.ko fix is probably necessary. |
nice - well, not nice iyswim, but, thanks @stefanha for hunting that down! |
The vhost_vsock.ko fix is available here: |
Thanks @stefanha - yeah, interesting fix :-) |
oh, and I guess I should ask really - err, is that a host side fix, or in-VM or even both @stefanha ? |
@grahamwhaley It only affects the vhost_vsock.ko host-side kernel module. I suggest keeping an eye on the patch (I've CCed you) until Michael Tsirkin (vhost maintainer in Linux) has merged it into his tree. At that point it's a good bet to ship while waiting for it to land in stable. |
@stefanha - any updates on the patch? Can you share pointer? |
The fix went into Linux 4.20: commit c38f57da428b033f2721b611d84b1f40bde674a8
|
Awesome!
Will it be back ported to 4.19 tree?
…On Thu, Feb 21, 2019, 00:16 Stefan Hajnoczi ***@***.***> wrote:
The fix went into Linux 4.20:
commit c38f57da428b033f2721b611d84b1f40bde674a8
Author: Stefan Hajnoczi ***@***.***
Date: Thu Dec 6 19:14:34 2018 +0000
vhost/vsock: fix reset orphans race with close timeout
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#618 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADTbJOTCmFyz0_pyARYnDJxJL1WW7uPks5vPXTJgaJpZM4WFv3h>
.
|
On Wed, Feb 20, 2019 at 4:19 PM Xu Wang ***@***.***> wrote:
Awesome!
Will it be back ported to 4.19 tree?
Yes, Greg KH has taken it into the 4.19-stable tree.
|
available since 4.19.12 |
We need to bump the kernel version from 4.14.67 to 4.19.23 in order to follow the recent kernel config bump. Fixes kata-containers#618 Fixes kata-containers#1029 Signed-off-by: Sebastien Boeuf <[email protected]>
We need to bump the kernel version from 4.14.67 to 4.19.24 in order to follow the recent kernel config bump. Fixes kata-containers#618 Fixes kata-containers#1029 Signed-off-by: Sebastien Boeuf <[email protected]>
vendor: dep check fixes
Description of problem
vsock sockets are leaked. I'm not entirely sure whether that's a kata issue or a kernel issue. Probably kernel but you are more familiar with the innards of vsocks and kvm than myself. My config file has
use_vock=true
. Other than that it's all standard.Meta details
Running
kata-collect-data.sh
version1.2.0 (commit 0bcb32f)
at2018-08-21.15:00:29.959157415+0200
.(With some fixups to choose the correct config file)
Runtime is
/usr/bin/kata-runtime
.kata-env
Output of "
/usr/bin/kata-runtime --kata-config /etc/kata-containers/configuration-vsock.toml kata-env
":Runtime config files
Runtime config file contents
Output of "
cat "/etc/kata-containers/configuration-vsock.toml"
":Image details
Initrd details
No initrd
Logfiles
Runtime logs
Recent runtime problems found in system journal:
Proxy logs
Recent proxy problems found in system journal:
Shim logs
Recent shim problems found in system journal:
Container manager details
Have
docker
Docker
Output of "
docker version
":Output of "
docker info
":Output of "
systemctl show docker
":No
kubectl
Packages
Have
dpkg
Output of "
dpkg -l|egrep "(cc-oci-runtimecc-runtimerunv|kata-proxy|kata-runtime|kata-shim|kata-containers-image|linux-container|qemu-)"
":No
rpm
The text was updated successfully, but these errors were encountered: