Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t: add storage-constrained tests for tiered storage systems #673

Merged
merged 7 commits into from
Jul 14, 2020

Conversation

SteVwonder
Copy link
Member

Problem: previous multi-tier storage tests allocated compute and storage
in the exact same ratio as found in the cluster. It is conceivable that
these tests would pass under a scheduler that only respected compute
constraints while ignoring IO constraints.

Solution: add tests with jobspecs that request 2x storage of the
existing tests to exhaust the storage resources before the compute ones.

Closes #623

@SteVwonder SteVwonder requested a review from dongahn June 17, 2020 02:19
@codecov-commenter
Copy link

codecov-commenter commented Jun 17, 2020

Codecov Report

Merging #673 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #673      +/-   ##
==========================================
- Coverage   75.22%   75.21%   -0.02%     
==========================================
  Files          78       78              
  Lines        8300     8300              
==========================================
- Hits         6244     6243       -1     
- Misses       2056     2057       +1     
Impacted Files Coverage Δ
qmanager/modules/qmanager_callbacks.cpp 76.42% <0.00%> (-0.72%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a44efb6...fd805d0. Read the comment docs.

@SteVwonder
Copy link
Member Author

I'm not sure why codecov is failing since this PR only adds tests without adding code, but besides that, I think this PR is ready for review.

@SteVwonder
Copy link
Member Author

PS - this is a relatively low priority PR, and I know we are up against a deadline. I'm more than happy to have this PR delayed until after the 0.9.0 tag if others agree that other outstanding PRs are more important.

@dongahn
Copy link
Member

dongahn commented Jul 10, 2020

@SteVwonder: I will review this right after lunch.

@dongahn dongahn requested a review from milroy July 10, 2020 19:00
Copy link
Member

@dongahn dongahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great coverage increase! Here is the first batch (only on the node-local coverage). I didn't want to possibly lose this during my lunch. I had lots of IT issues -- I wonder if this is a good time to replace my laptop... Will circle back to the rest of the multi-tiered storage levels.

t/data/resource/expected/.gitignore Outdated Show resolved Hide resolved
t/t3018-resource-mtl0.t Outdated Show resolved Hide resolved
t/t3018-resource-mtl0.t Outdated Show resolved Hide resolved
Copy link
Member

@dongahn dongahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SteVwonder: This PR looks good generally. I just spotted a few minor issues but I will approve the PR. Once you address those minor issues, feel free to add the MWP label. Thanks!

@milroy
Copy link
Member

milroy commented Jul 10, 2020

I can review this PR in the next three hours.

@milroy
Copy link
Member

milroy commented Jul 11, 2020

I can review this PR in the next three hours.

It seems this estimate was premature. I needed to address changes with my PR and haven't gotten around to review this. I'll get to it ASAP!

@SteVwonder SteVwonder force-pushed the unbalanced-tiered-storage branch 3 times, most recently from 31be293 to 9c64eaa Compare July 13, 2020 22:09
@SteVwonder
Copy link
Member Author

@dongahn: Thanks for the feedback! I just rebased and force-pushed with the changes. I forgot and autosquashed before pushing so that the fixup commits were auto-rolled in. Sorry about that. I can probably undo the autosquash with reflog if you want.

@dongahn
Copy link
Member

dongahn commented Jul 13, 2020

I can probably undo the autosquash with reflog if you want.

Not necessary! I will take a brief look at it once again and put the MWP label if every looks okay.

@dongahn
Copy link
Member

dongahn commented Jul 13, 2020

autosquashed

BTW, how do you do this? :-)

This LGTM. MWP.

@dongahn
Copy link
Member

dongahn commented Jul 13, 2020

@milroy: Ah... I forgot you were also reviewing this PR. Let me know if you have any comments that you want to see resolved before the MWP.

@SteVwonder
Copy link
Member Author

BTW, how do you do this? :-)

When you commit a fixup or squash commit, use the --fixup or --squash flags (fixup when no changes to the original commit message are required, and squash when changes are required) and point at a specific commit to squash into. Then it's just a matter of running git rebase --interactive --autosquash. Really good blog post on the feature that I'll avoid duplicating here: https://thoughtbot.com/blog/autosquashing-git-commits

Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I only found a couple of typos in a commit message.

@SteVwonder SteVwonder force-pushed the unbalanced-tiered-storage branch from 9c64eaa to 8114dca Compare July 14, 2020 17:23
@SteVwonder
Copy link
Member Author

Thanks @dongahn and @milroy for the feedback! Just addressed @milroy's comment and pushed. I'm going to set the MWP label.

@SteVwonder SteVwonder added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jul 14, 2020
@dongahn
Copy link
Member

dongahn commented Jul 14, 2020

@SteVwonder: sounds great to me. Great addition to our multi-tiered storage support!

.out is the suffix of the files contained in this data directory
Problem: previous multi-tier storage tests allocated compute and storage
in the exact same ratio as found in the cluster.  It is conceivable that
these tests would pass under a scheduler that only respected compute
constraints while ignoring IO constraints.

Solution: add tests with jobspecs that request 2x storage of the
existing tests to exhaust the storage resources before the compute ones

Note: the node-local storage case is known in the tests as L0
Add storage-constrained tests -- tests with jobspecs that request 2x
storage of the existing tests to exhaust the storage resources before
the compute ones -- for the rack-local storage case, known in the tests
as L1.
These tests explicity request that the rack-local SSDs and the compute
nodes allocated to the job share the same rack.  Previous tests allowed
the SSDs and nodes to be on different racks.
Add storage-constrained tests -- tests with jobspecs that request 2x
storage of the existing tests to exhaust the storage resources before
the compute ones -- for the global storage case, known in the tests
as L2.
Add storage-constrained tests -- tests with jobspecs that request 2x
storage of the existing tests to exhaust the storage resources before
the compute ones -- for the hybrid storage case, known in the tests
as L3.
Problem: when running `make distcheck`, the creation of a tarball fails
due to filenames that are too long.  This is because autotools defaults
to the oldest, most compatible tar format (v7) which limits filename
length to 99 characters.

Solution: switch to a newer tar format (which is honestly still quite
old - defined in the 1988 POSIX Standard) called `ustar` that supports
filenames up to 256 characters in length. If this becomes a problem in
the future, we can switch to `pax` which has no limits.  Additionally,
set the filename-length-max option so that the entire build fails rather
than the tar silently failing to be created.

Further Reading:
- https://www.gnu.org/software/tar/manual/html_section/tar_67.html
- https://www.gnu.org/software/automake/manual/html_node/List-of-Automake-options.html
- https://noiselabs.io/tar-file-name-is-too-long-max-99/
@SteVwonder SteVwonder force-pushed the unbalanced-tiered-storage branch from 8114dca to fd805d0 Compare July 14, 2020 17:38
@mergify mergify bot merged commit 9d60c75 into flux-framework:master Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add test for multi-tier storage where IO is the constrained resource
4 participants