Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #4910: making upload more reliable #4968

Merged
merged 1 commit into from
Apr 4, 2023
Merged

Conversation

shawkins
Copy link
Contributor

@shawkins shawkins commented Mar 13, 2023

Description

Fix #4910

Trying to fully address the upload failures.

  • they seem to happen more with the single file upload rather than with the directory upload, but that could simply be because we have more tests doing single file uploads. Comparing to kubectl, they do everything through tar, so I tried to do that here. It will limit the files to 8GB. However commons compress tar does not work well with streams of unknown size. There is an ugly reflection based workaround, which may be too hackish.
  • I couldn't get completion detection working very well with just a single exec if we use gzip. You have to send a very large post amount of data to force the tar -z to give up on the stdIn without an explicit close. Here again, like kubectl I've just dropped using gzip - https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/cp/cp.go
  • Assuming the missing data may still happen with tar, at the end of the tar a hidden file that signals we're properly done. A background process will see that and we'll pick that up and echo to stdOut. The other approach to this is to copy the file in one exec, then perform a size / termination check, expansion, and removal of the temp file as a second exec. We'd be required to use that approach to retain compression.

@manusa and others - any thoughts on the first and last points?

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change
  • Chore (non-breaking change which doesn't affect codebase;
    test, version modification, documentation, etc.)

Checklist

  • Code contributed by me aligns with current project license: Apache 2.0
  • I Added CHANGELOG entry regarding this change
  • I have implemented unit tests to cover my changes
  • I have added/updated the javadocs and other documentation accordingly
  • No new bugs, code smells, etc. in SonarCloud report
  • I tested my code in Kubernetes
  • I tested my code in OpenShift

@shawkins
Copy link
Contributor Author

We can workaround 1 by shading the class ourselves if/until an upstream fix is supplied. The only other problem is that commons-compress is currently an optional dependency, it will be required then for all file uploads.

I'll decouple the fix of EOF detection until after we hit more of the upload issues - if for some reason they are more problematic with piping cat to file then we may not need to do anything else.

@shawkins shawkins force-pushed the iss4910 branch 3 times, most recently from ec2dc05 to c1b6494 Compare March 23, 2023 14:04
@shawkins
Copy link
Contributor Author

We can workaround 1 by shading the class ourselves if/until an upstream fix is supplied. The only other problem is that commons-compress is currently an optional dependency, it will be required then for all file uploads.

This does not work - the tar still writes the unlimited size to the header in such as way as limiting later will result in files that are padded up to the nearest record block. So we will need two different upload strategies - one for streams and one for tars.

I'll decouple the fix of EOF detection until after we hit more of the upload issues - if for some reason they are more problematic with piping cat to file then we may not need to do anything else.

These changes now include launching a nanny process to check for the expected tar size before closing. I've had no luck reproducing the error locally to confirm this addresses the failing case - at worst it's the closing of the websocket / stdin that triggers the final writing of bytes. We won't know that unfortunately if / until this prospective change starts failing in github runs.

The other downside of the changes so far is that all uploads, not just directories, will require commons-compress. Everything is now using CountingOutputStream for verification.

@shawkins shawkins marked this pull request as ready for review March 23, 2023 14:04
@shawkins
Copy link
Contributor Author

@manusa @rohanKanojia if no one objects to the approach here, I'll go ahead and fix up the unit tests to work with the new upload verification logic.

@shawkins shawkins force-pushed the iss4910 branch 2 times, most recently from cbd8626 to e7da77b Compare March 23, 2023 17:57
@shawkins
Copy link
Contributor Author

shawkins commented Mar 23, 2023

Based upon the failures in https://github.com/fabric8io/kubernetes-client/actions/runs/4503720351/jobs/7927685644?pr=4968 and the lack of responses back from the server, we can infer that while we had written all of the bytes, the final bytes had not been written on the server side. So the flushing of stdin / processing by cat is not predictable and we cannot simply wait on it. We either would need to send more data through, then truncate or we can simply keep retrying. It seems to make more sense to do the latter - but that won't work for streams, unless we create a copy.

The easiest thing to do at this point would be to just let the upload finish, then verify the size, and if it's incorrect then just report it as a failed upload. We'll let the user decide for now how to work around it. In our integration tests we can add some simple retry logic to deflake the tests.

@shawkins shawkins force-pushed the iss4910 branch 4 times, most recently from 8211910 to faee793 Compare March 24, 2023 14:19
@shawkins shawkins changed the title fix #4910: speculative changes for upload fix #4910: making upload more reliable Mar 24, 2023
@shawkins shawkins force-pushed the iss4910 branch 6 times, most recently from 40c49fe to 57d8d70 Compare March 28, 2023 18:27
@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 3, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

63.3% 63.3% Coverage
0.0% 0.0% Duplication

Comment on lines +303 to +305
void retryUpload(BooleanSupplier operation) {
Awaitility.await().atMost(60, TimeUnit.SECONDS).until(operation::getAsBoolean);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced about this, might eventually hide something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly as we are only dealing with the boolean return so this does not qualify exactly the nature of the failed upload. However most other problems are likely to be deterministic, in which case we'll exceed the timeout. Only other somewhat rare non-deterministic problems would be missed.

@manusa manusa added this to the 6.6.0 milestone Apr 4, 2023
@manusa manusa merged commit d15341c into fabric8io:master Apr 4, 2023
@shawkins shawkins mentioned this pull request Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues with upload / failures with PodIT
3 participants