Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload file/dir to pod does not work in GKE #5599

Closed
lior444 opened this issue Nov 20, 2023 · 19 comments
Closed

Upload file/dir to pod does not work in GKE #5599

lior444 opened this issue Nov 20, 2023 · 19 comments

Comments

@lior444
Copy link

lior444 commented Nov 20, 2023

Describe the bug

Ever since our clusters were upgraded to a GKE version that works only with containerd, the upload method simply does not work. It does work on my local minikube, but not on GKE (1.25.10-gke.2700).

Trying to debug this, I see that the upload process starts since I see a fragment of the file for less then a second in /tmp but it removed immediately after starting and the upload method returns false.
Tried to play with kubernetes.upload.connection.timeout & kubernetes.upload.request.timeout but it doesn't help.

Important to mention that my source is actually a different pod in the same cluster from which I initiate the kubernetes-client. From this pod the tests are executed and the API to deploy and trying to upload the file to the remote pod is ran.
A simple kubectl cp command in the cluster does work.

Any ides?

Fabric8 Kubernetes Client version

6.9.2

Steps to reproduce

  1. Deploy a maven pod on a GKE cluster from which you can execute the kubernetes-client API and tests.
  2. Try to deploy a different pod and upload a file to it using kubernetes-client code.

Expected behavior

The upload method will return false and the file/dir will not be copied.

Runtime

other (please specify in additional context)

Kubernetes API Server version

1.25.3@latest

Environment

GKE

Fabric8 Kubernetes Client Logs

No logs from the fabric8 are seen...

Additional context

No response

@manusa
Copy link
Member

manusa commented Nov 20, 2023

Might be the same root-cause as for #5527.

In any case, the same fix should likely apply.

@cr-liorholtzman
Copy link

@manusa hope so...
Though I don't see the "Read-only file system" in my case...

@manusa
Copy link
Member

manusa commented Nov 20, 2023

Well, I don't mean to say that the problem is a read-only file system or a /tmp directory with no write rights. But rather a /tmp directory that is not usable for whatever reason.

If in your case, the target directory is writeable, the same approach we can use to fix #5527 should also be valid for this.

@cr-liorholtzman
Copy link

@manusa I understand. To be honest, I'm not 100% sure that's the case because if it was, would kubectl cp work?
I also tried kubectl cp directly to tmp and it does work.

@manusa
Copy link
Member

manusa commented Nov 20, 2023

I also tried kubectl cp directly to tmp and it does work.

OK, this is weird then. Maybe @shawkins has some ideas about why this might be failing in our case and not for kubectl.

@shawkins
Copy link
Contributor

@cr-liorholtzman @manusa It may take more debugging to see why it's returning false - with debug logging enabled you should see the reason false is returned.

To workaround a kubernetes bug with websockets we added the logic that will first copy the file to the tmp location, then validate its size. If for some reason your container is always reporting an unexpected size, then we'll consider the operation to have failed.

@cr-liorholtzman
Copy link

cr-liorholtzman commented Nov 20, 2023

@shawkins do you have your own debug property or just the regular maven/spring?
Also, It happens on several container types, not just one I'm afraid

@shawkins
Copy link
Contributor

The client uses slf4j, so you can configure it with whatever is bound as the logging implementation.

The relevant package / logging context here is io.fabric8.kubernetes.client.dsl.internal.uploadable - but you can go much higher up if you want to see more of what is happening.

@cr-liorholtzman
Copy link

cr-liorholtzman commented Nov 21, 2023

@shawkins So I executed the test with debug and below are the logs. From what I can understand it's either that the webSocket is closed prematurely, or maybe fabric8 fails to transfer the file from /tmp to the destination folder (not sure if the validation is before/after the file is transferred inside the pod

12:38:23 2023-11-21 10:38:23:542 [fullIntegrationTest-3] INFO c.c.d.s.SpringBootBaseContainerService.waitTillUp:53 - Container XXXX is UP 12:38:23 2023-11-21 10:38:23:543 [fullIntegrationTest-3] INFO c.c.d.a.i.IntegrationTests.fullIntegrationTest:134 - XXXX is up 12:38:23 2023-11-21 10:38:23:543 [fullIntegrationTest-3] INFO c.c.d.a.u.ContainerHelper.uploadFolderToPod:152 - Uploading dir from : /XXXXX/20220227 12:38:23 2023-11-21 10:38:23:603 [fullIntegrationTest-3] DEBUG i.f.k.c.d.i.c.v.PodOperationsImpl.validateOrDefaultContainerId:350 - using first container XXXX in pod XXXX 12:38:24 2023-11-21 10:38:23:701 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.handle:120 - exec message received on channel errorChannel: {"metadata":{},"status":"Success"} 12:38:24 2023-11-21 10:38:23:702 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.onClose:373 - Exec Web Socket: On Close with code:[1000], due to: [] 12:38:24 2023-11-21 10:38:23:740 [OkHttp Dispatcher] DEBUG i.f.k.c.d.i.AbstractWatchManager.startWatch:314 - Watching https://XXXXX/api/v1/namespaces/XXXX?allowWatchBookmarks=true&fieldSelector=metadata.name%3DcXXXXXX&resourceVersion=1245545873&timeoutSeconds=600&watch=true... 12:38:24 2023-11-21 10:38:23:741 [fullIntegrationTest-3] DEBUG i.f.k.c.d.i.c.v.PodOperationsImpl.validateOrDefaultContainerId:350 - using first container XXXXX 12:38:24 2023-11-21 10:38:23:802 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.WatcherWebSocketListener.onOpen:40 - WebSocket successfully opened 12:38:24 2023-11-21 10:38:23:803 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.AbstractWatchManager.close:323 - Force closing the watch 12:38:24 2023-11-21 10:38:23:803 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.AbstractWatchManager.closeRequest:160 - Closing the current watch 12:38:24 2023-11-21 10:38:23:804 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.AbstractWatchManager.close:323 - Force closing the watch 12:38:24 2023-11-21 10:38:23:805 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.AbstractWatchManager.closeEvent:199 - Ignoring duplicate firing of onClose event 12:38:24 2023-11-21 10:38:23:806 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.WatcherWebSocketListener.onClose:65 - WebSocket close received. code: 1000, reason: 12:38:24 2023-11-21 10:38:23:806 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.AbstractWatchManager.scheduleReconnect:219 - Ignoring already closed/closing connection 12:38:24 2023-11-21 10:38:23:863 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.handle:111 - exec message received 2 bytes on channel stdOut 12:38:24 2023-11-21 10:38:23:900 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.handle:120 - exec message received on channel errorChannel: {"metadata":{},"status":"Success"} 12:38:24 2023-11-21 10:38:23:903 [OkHttp https://XXXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.onClose:373 - Exec Web Socket: On Close with code:[1000], due to: [] 12:38:24 2023-11-21 10:38:23:903 [fullIntegrationTest-3] DEBUG i.f.k.c.d.i.u.PodUpload.upload:123 - upload file size validation failed, expected 1869312, but was 0

@shawkins
Copy link
Contributor

@cr-liorholtzman the problem is with the size validation:

DEBUG i.f.k.c.d.i.u.PodUpload.upload:123 - upload file size validation failed, expected 1869312, but was 0

It's running "wc -c < tmpfile" to get the file size - could you try that directly on your pod and see what it produces?

kubernetes/kubernetes#89899 is the upstream issue that prevents websocket based file uploads from working reliably. The underlying fix was merged not that long ago, so eventually this workaround on our side won't be needed.

@cr-liorholtzman
Copy link

cr-liorholtzman commented Nov 21, 2023

@shawkins You mean simply running "wc -c < tmpfile" in the pod? The tmpfile is removed shortly after the upload starts, I can see it for just a fragile of second...

@shawkins
Copy link
Contributor

@shawkins You mean simply running "wc -c < tmpfile" in the pod? The tmpfile is removed shortly after the upload starts, I can see it for just a fragile of second...

I mean on any file to start just to validate that the appropriate size is seen.

@cr-liorholtzman
Copy link

cr-liorholtzman commented Nov 21, 2023

@shawkins tested now, wc -c works well...
The upload also works locally if I test it with minikube, it only fails on GKE.

@shawkins
Copy link
Contributor

tested now, wc -c works well...

Meaning that on boty GKE and minikube you get an appropriate value back for any given file? If that's the case it would imply that we're not transferring anything successfully to GKE in the first place.

@cr-liorholtzman
Copy link

cr-liorholtzman commented Nov 21, 2023

@shawkins Yup, the command works on both environments... upload fails only in GKE

@cr-liorholtzman
Copy link

@shawkins answering here to keep track on the issue, what other logs can I extract? the above logs are after enabling "trace" on f8 on the .internal level. Is there anything else I can enable?

@shawkins
Copy link
Contributor

@shawkins answering here to keep track on the issue, what other logs can I extract? the above logs are after enabling "trace" on f8 on the .internal level. Is there anything else I can enable?

If this is all we have to work with, this is what I see:

2023-11-21 10:38:23:543 [fullIntegrationTest-3] INFO  c.c.d.a.u.ContainerHelper.uploadFolderToPod:152 - Uploading dir from : /XXXXX/20220227 12:38:23
2023-11-21 10:38:23:603 [fullIntegrationTest-3] DEBUG i.f.k.c.d.i.c.v.PodOperationsImpl.validateOrDefaultContainerId:350 - using first container XXXX in pod XXXX 12:38:24

The client needs to determine what container to use since it's not set on your upload operation. Based upon this log ordering is appears that the pod is already ready, so it won't shouldn't need a watch - just simply perform a list and glean the container id from there.

2023-11-21 10:38:23:701 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.handle:120 - exec message received on channel errorChannel: {"metadata":{},"status":"Success"} 12:38:24
2023-11-21 10:38:23:702 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.onClose:373 - Exec Web Socket: On Close with code:[1000], due to: [] 12:38:24

There are normally 3 websocket invocations expected for an upload. The upload, the size check, and the exploading of the tar - these are currently three different execs/websockets. You expect to see {"metadata":{},"status":"Success"} when an command completes successfully - that is not a message for a watch websocket.

Since these ExecWebSocketListener events appearing first, it should be the result of the upload - it's atypical to see success here because it's the termination of the websocket that completes the upload - the websocket protocol currently lacks a way to send an end of input. It simply looks like the pod / api-server is signaling it is done with the upload without receiving anything.

2023-11-21 10:38:23:740 [OkHttp Dispatcher] DEBUG i.f.k.c.d.i.AbstractWatchManager.startWatch:314 - Watching https://XXXXX/api/v1/namespaces/XXXX?allowWatchBookmarks=true&fieldSelector=metadata.name%3DcXXXXXX&resourceVersion=1245545873&timeoutSeconds=600&watch=true... 12:38:24

In your code ContainerHelper or above are you starting a Watch, Informer, etc.? All of the subsequent AbstractWatchManager logs relate to this Watch.

2023-11-21 10:38:23:863 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.handle:111 - exec message received 2 bytes on channel stdOut 12:38:24
2023-11-21 10:38:23:900 [OkHttp https://XXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.handle:120 - exec message received on channel errorChannel: {"metadata":{},"status":"Success"} 12:38:24
2023-11-21 10:38:23:903 [OkHttp https://XXXXX/...] DEBUG i.f.k.c.d.i.ExecWebSocketListener.onClose:373 - Exec Web Socket: On Close with code:[1000], due to: [] 12:38:24

This is the client performing the size check and getting back 0.

Copy link

stale bot commented Feb 26, 2024

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

Copy link

stale bot commented May 27, 2024

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants