Issues with upload / failures with PodIT #4910

shawkins · 2023-02-23T14:16:35Z

With kubernetes 1.26 we are still seeing errors with pod upload:

Error:  Failures: 
Error:    PodIT.uploadBinaryFile:318 array lengths differ, expected: <16385> but was: <8192>
Error:    PodIT.uploadFile:298 expected: <I'm uploaded> but was: <>

https://github.com/fabric8io/kubernetes-client/actions/runs/4252695615/jobs/7396899761#step:6:579

With additional trace logging failed test runs showed the same data being sent to the server either way - as measured down at the level at which okhttp writes to the ssl socket.

There doesn't seem to be a good explanation for this other than we are more prone to seeing it on the containerd minikube runtime.

The issue does seem to match one reported upstream kubernetes/kubernetes#112834

We may want to add a formal warning about this behavior so that users are aware of potential data loss.

The text was updated successfully, but these errors were encountered:

andreaTP · 2023-03-02T13:58:38Z

I keep hitting this issue on an unrelated PR:
https://github.com/fabric8io/kubernetes-client/actions/runs/4313568880/jobs/7526046821#step:6:500

Should we consider skipping those tests until we have a good solution?

shawkins · 2023-03-02T14:12:33Z

The upload issue is prevalent against 1.26 / containerd - with all client types. It typically looks like an entire packet is being lost, but now here's one with just a single byte that's off https://github.com/fabric8io/kubernetes-client/actions/runs/4310764866/jobs/7519523723#step:5:1160

If we disable those tests we should definitely add a warning in the release notes that pod upload seems broken. The guess is that the api server is introducing message ordering issues each direction - sending or receiving - which would also cause the download issue.

The download issue seems to be much more rare - tried reproducing that more yesterday without success. I'd probably leave that test running for now.

shawkins · 2023-03-06T21:55:56Z

The copyDir failures with the e2e tests and vertx are in the same vain as the download issue, but seems more reproducible.

I modified the code to fully read the input stream before passing it off to the tar utility to show both the length and an md5 sum of the contents. For each successful run that shows:
12800 Hwu1HVU3N2gLBk1b8cUA+g==

An unsuccessful run looks like:

[main] DEBUG io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl - using first container busybox in pod pod-standard
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received 512 bytes on channel stdOut
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received on channel stdErr: tar: removing leading '/' from member names

[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received 4096 bytes on channel stdOut
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - Exec Web Socket: On Close with code:[1000], due to: [null]
[-1206678562-pool-1-thread-2] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - Exec Web Socket: completed with a null exit code - no status was received prior to onClose
4608 U/9K/Wr7WsvMAIEcua8Hqw==
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received 8192 bytes on channel stdOut

Note the output of the length / checksum before the message that contains the remaining bytes - all of which occurs after receiving close from the server.

The workaround for the download handling is that we should be able to expect an errorChannel / exit code message. Rather than immediately terminate with onClose we can wait some amount of time for that to appear. Local testing seemed to confirm that this worked.

The upload is thornier. We really don't have any good way to know if the server has received our data. There is no expected exit code message, and even if it did come if the api server has misordered messages at best it would show an error if we used tar. The best I can come up with is that we could compute / request a checksum afterwords and clearly error and/or retry some number of times until there is a match.

Added a comment to the upstream issue and linking to #4923 for good measure.

…condition with exec

…with exec streams

… wait to upload

…with exec streams

Signed-off-by: Marc Nuri <[email protected]>

shawkins mentioned this issue Mar 6, 2023

apiserver: command exec in pod randomly drops stdin kubernetes/kubernetes#112834

Open

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023

fix fabric8io#4910 fabric8io#4923: addressing vertx and general race …

42d0253

…condition with exec

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023

fix fabric8io#4910 / fabric8io#4923 addressing inconsistent behavior …

c96e3b4

…with exec streams

shawkins mentioned this issue Mar 9, 2023

fix #4910 / #4923 addressing inconsistent behavior with exec streams #4959

Merged

11 tasks

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023

fix fabric8io#4910 / fabric8io#4923 addressing inconsistent behavior …

3985d24

…with exec streams

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023

fix fabric8io#4910 / fabric8io#4923 trying to induce copyFile error

be6ca5c

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023

fix fabric8io#4910 / fabric8io#4923 adding some additional messages /…

5ed5ab3

… wait to upload

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023

fix fabric8io#4910 / fabric8io#4923 adding some additional messages /…

cdad995

… wait to upload

manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Mar 10, 2023

fix fabric8io#4910 / fabric8io#4923 addressing inconsistent behavior …

02c3557

…with exec streams

manusa closed this as completed in 11daf25 Mar 10, 2023

manusa reopened this Mar 10, 2023

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 13, 2023

fix fabric8io#4910: speculative changes for upload

69be736

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 22, 2023

fix fabric8io#4910: using tar for most uploads

ba8e635

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 23, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

ec2dc05

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 23, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

c1b6494

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 23, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

1c67078

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

faee793

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

841c81c

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

650961d

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

35f1280

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

57d8d70

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 28, 2023

fix fabric8io#4910: simplifying the optional check for copyDir

8ca1d37

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 28, 2023

fix fabric8io#4910: simplifying the optional check for copyDir

40c49fe

manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Mar 31, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

df956f8

manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Apr 3, 2023

fix fabric8io#4910: using tar for most uploads and verifying upload

86e740f

manusa mentioned this issue Apr 4, 2023

fix #4910: making upload more reliable #4968

Merged

11 tasks

manusa added this to the 6.6.0 milestone Apr 4, 2023

manusa closed this as completed in #4968 Apr 4, 2023

manusa pushed a commit that referenced this issue Apr 4, 2023

fix #4910: using tar for most uploads and verifying upload

d15341c

manusa added a commit that referenced this issue May 3, 2023

chore: address missing breaking changelog note for #4910

5eb8bf5

Signed-off-by: Marc Nuri <[email protected]>

homerskywalker mentioned this issue Oct 16, 2023

Unable to transfer file to pod if /tmp of pod is read-only #5527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with upload / failures with PodIT #4910

Issues with upload / failures with PodIT #4910

shawkins commented Feb 23, 2023

andreaTP commented Mar 2, 2023

shawkins commented Mar 2, 2023

shawkins commented Mar 6, 2023 •

edited

Loading

Issues with upload / failures with PodIT #4910

Issues with upload / failures with PodIT #4910

Comments

shawkins commented Feb 23, 2023

andreaTP commented Mar 2, 2023

shawkins commented Mar 2, 2023

shawkins commented Mar 6, 2023 • edited Loading

shawkins commented Mar 6, 2023 •

edited

Loading