Requests from within the cluster timeout #2208

asankov-cb · 2021-04-20T10:40:22Z

What happened:
I am running a Kind cluster. This cluster has 2 services which periodically communicate to a backend via both HTTP and gRPC. However, all of the outbound requests fail with a timeout.

What you expected to happen:
Requests are successful.

How to reproduce it (as minimally and precisely as possible):
To do this you would need access to an enterprise product, which is still not-GA. I am opening this issue more as a question and looking for advice about what I need to look for, and whether someone else has had a similar problem.

When googling I came across #717. I had issues with resource limits again, so I have them pretty high. Anyhow, I bumped them up once again (CPUs - 10, Memory - 12GB, Swap - 1.5GB, Disk image size 80GB/61GB used), and I still reproduce the issue.

I am opening this issue with kind, because we are not able to reproduce the problem on any other environment (EKS, AKS, etc.).
Anything else we need to know?:
Here I have some log messages. The first one is from a failing HTTP request to the backend:

time="2021-04-20T10:14:44Z" level=warning msg="failed to update cluster status: Get \"<URL_OBFUSCATED>": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

and the other ones is for failing gRPC request:

time="2021-04-20T10:14:56Z" level=error msg="failed to send resource updated message: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: EOF\""

Environment:

kind version: (use kind version):
- kind version 0.11.0-alpha
- but also with kind version 0.10.0
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:14:17Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (use docker info):

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc., v0.6.0)

Server:
 Containers: 92
  Running: 57
  Paused: 0
  Stopped: 35
 Images: 144
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.25-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 10
 Total Memory: 11.7GiB
 Name: docker-desktop
 ID: NW2M:Y7PR:B26O:BX7O:2PR4:7RM2:BISS:QURF:B3MB:LJIK:7AC5:A3MN
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

BenTheElder · 2021-04-20T17:45:16Z

To do this you would need access to an enterprise product, which is still not-GA. I am opening this issue more as a question and looking for advice about what I need to look for, and whether someone else has had a similar problem.

There's really not a lot we can do here to debug your external service timing out. There are many factors ...

kind version 0.11.0-alpha

I don't recommend using pre-release versions of kind unless you're developing Kubernetes itself.

If you do however, it's helpful to use a version built with the makefile so we get the git information which is not possible with go get etc.

asankov-cb · 2021-04-20T20:24:29Z

There's really not a lot we can do here to debug your external service timing out. There are many factors ...

I understand that. As said, I was just wondering whether this kind of error would ring a bell for someone. If not, I guess we can close this, and I will be back to debugging the networking of my local installation.

I don't recommend using pre-release versions of kind unless you're developing Kubernetes itself.

I see, but I also tried with kind 0.10.0 and got the same errors with that as well.

aojea · 2021-04-21T22:38:30Z

it is a connectivity problem, so you have to find where the connectivity is broken;
Are you able to connect to the external service from the VM where kind is running?
If yes, are you able to reach the service from one of the kind nodes (container)?
if yes, the traceroute from the pod, check if the external service receives the request but the reply never gets back to the pod, ...

asankov-cb · 2021-06-08T19:48:26Z

This was due to a bug in Docker Desktop. They introduced some kind of proxy which was not using the local cert store and presented a lot of issues. This being just one of them. It was fixed relatively quick. Currently, I am with version 3.3.3 of Docker Desktop and this does not reproduce anymore.

BenTheElder · 2021-06-08T20:14:11Z

Phew! Thanks for following up!

asankov-cb added the kind/bug Categorizes issue or PR as related to a bug. label Apr 20, 2021

BenTheElder added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 20, 2021

asankov-cb closed this as completed Jun 8, 2021

BenTheElder assigned aojea and asankov-cb Jun 8, 2021

BenTheElder mentioned this issue Jun 11, 2021

Delay in network setup for Pod to Service communication #2280

Closed

BenTheElder added the kind/external upstream bugs label Jun 11, 2021

BenTheElder mentioned this issue Jun 15, 2021

New containers unable to reach the network for a while when using Calico #2308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requests from within the cluster timeout #2208

Requests from within the cluster timeout #2208

asankov-cb commented Apr 20, 2021

BenTheElder commented Apr 20, 2021

asankov-cb commented Apr 20, 2021

aojea commented Apr 21, 2021

asankov-cb commented Jun 8, 2021

BenTheElder commented Jun 8, 2021

Requests from within the cluster timeout #2208

Requests from within the cluster timeout #2208

Comments

asankov-cb commented Apr 20, 2021

BenTheElder commented Apr 20, 2021

asankov-cb commented Apr 20, 2021

aojea commented Apr 21, 2021

asankov-cb commented Jun 8, 2021

BenTheElder commented Jun 8, 2021