Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle ENOSPC more cleanly #3513

Merged
merged 6 commits into from
Aug 7, 2023

Conversation

dbutenhof
Copy link
Member

PBENCH-1239

In Pbench ops review, after staging the latest main and with the intent of testing the new Tarball.extract on a large dataset, we pushed a >7Gb uperf tarball. This failed with an internal error, leaving a partial tarball copy in the ARCHIVE controller directory, revealing several related problems:

  1. The cache manager was using shlib.copy2, which copies the tarball from the staging area into the archive tree. Because nginx also caches the entire data stream, this triples the storage requirements to upload a tarball.
  2. On copy failure, the cache manager did not delete the partial file.
  3. While the initial data stream save code handled an ENOSPC specially, after mapping trouble in Werkzeug it was reported as a "server internal error", which is not ideal.
  4. The MD5 file write was not similarly protected: and while this is a small file and ENOSPC is unlikely, we should be prepared to handle it gracefully.

This PR changes the cache manager to use shlib.move (which was the original intent) to avoid a third copy of the tarball. On failure, we unlink the file. Both the initial tarball and MD5 write handle ENOSPC and return HTTP status 413 (request entity too large), which is not a perfect mapping but a standard error code that Werkzeug can handle.

@dbutenhof dbutenhof added Server API Of and relating to application programming interfaces to services and functions labels Aug 4, 2023
@dbutenhof dbutenhof requested review from ndokos and webbnh August 4, 2023 15:42
@dbutenhof dbutenhof self-assigned this Aug 4, 2023
webbnh
webbnh previously approved these changes Aug 4, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good. However, there are some small items which you might want to address.

lib/pbench/server/api/resources/intake_base.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
controller = cm.archive_root / "ABC"
controller.mkdir()
(controller / source_tarball.name).write_text("Send in the clones")

# Attempting to create a dataset from the md5 file should result in
# a duplicate dataset error
with pytest.raises(DuplicateTarball) as exc:
cm.create(source_tarball)
assert exc.value.tarball == Dataset.stem(source_tarball)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coverage report indicates that we do not test the case of a pre-existing MD5 file (without a pre-existing tarball).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a new case, nor is the condition very realistic. I had kinda hoped to get this done today and I'm trying to finish something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All true, but, realistic or not, we have code that checks for it, and, therefore, it arguably should be exercised by the tests. Whether it should be addressed in this PR or not is a totally separate question. (I'm just reporting what I found...and I did approve the PR....)

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_upload.py Outdated Show resolved Hide resolved
webbnh
webbnh previously approved these changes Aug 4, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

I just have the one question lingering from our previous exchange.

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
webbnh
webbnh previously approved these changes Aug 7, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I approve...unless there's another change coming. 😉

PBENCH-1239

In Pbench ops review, after staging the latest `main` and with the intent of
testing the new `Tarball.extract` on a large dataset, we pushed a >7Gb `uperf`
tarball. This failed with an internal error, leaving a partial tarball copy in
the `ARCHIVE` controller directory, revealing several related problems:
1. The cache manager was using `shlib.copy2`, which copies the tarball from
the staging area into the archive tree. Because `nginx` also caches the entire
data stream, this *triples* the storage requirements to upload a tarball.
2. On copy failure, the cache manager did not delete the partial file.
3. While the initial data stream save code handled an `ENOSPC` specially, after
mapping trouble in Werkzeug it was reported as a "server internal error",
which is not ideal.
4. The MD5 file write was not similarly protected: and while this is a small
file and `ENOSPC` is unlikely, we should be prepared to handle it gracefully.

This PR changes the cache manager to use `shlib.move` (which was the original
intent) to avoid a third copy of the tarball. On failure, we unlink the file.
Both the initial tarball and MD5 write handle `ENOSPC` and return HTTP status
413 (request entity too large), which is not a perfect mapping but a standard
error code that Werkzeug can handle.
webbnh
webbnh previously approved these changes Aug 7, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing worth blocking the merge over, but I found some new items for you to consider.

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am so sorry, Dave, but I think I found a bug in one of the test assertions. 😞

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
@dbutenhof dbutenhof requested a review from webbnh August 7, 2023 19:14
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements!

@dbutenhof dbutenhof merged commit b4ef1dd into distributed-system-analysis:main Aug 7, 2023
@dbutenhof dbutenhof deleted the outaspace branch August 7, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Of and relating to application programming interfaces to services and functions Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants