-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hsi get retry #218
hsi get retry #218
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengzhuzhang @golaz Luckily this issue doesn't require too many code changes, but I want to confirm a couple design choices -- see comments. Thanks!
zstash/hpss.py
Outdated
@@ -95,7 +95,14 @@ def hpss_transfer( | |||
# Transfer file using `hsi` | |||
command: str = 'hsi -q "cd {}; {} {}"'.format(hpss, transfer_command, name) | |||
error_str: str = "Transferring file {} HPSS: {}".format(transfer_word, name) | |||
run_command(command, error_str) | |||
if transfer_type == "get": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to retry any failing hsi
operation, or only hsi get
?
zstash/hpss.py
Outdated
run_command(command, error_str) | ||
except RuntimeError: | ||
# Retry if `hsi get` fails. | ||
run_command(command, error_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to always retry hsi get
in the event of failure? Or do we indeed want a command line option to request this?
#212 specifically addresses the incomplete tar error, whereas #198 is to simply retry We can add an additional test for the tar file being on disk (for backwards-compatibility, run check if tars table exists). We could make a command line option for number of retry attempts (default=1). |
20dfd52
to
e8647e0
Compare
zstash/hpss.py
Outdated
tries = retries + 1 | ||
while tries > 0: | ||
tries -= 1 | ||
try: | ||
run_command(command, error_str) | ||
# Command was successful. No need to retry. | ||
break | ||
except RuntimeError as e: | ||
actual_size = os.path.getsize(name) | ||
if ( | ||
(tries > 0) | ||
and (transfer_type == "get") | ||
and cur | ||
and tars_table_exists(cur) | ||
): | ||
cur.execute(f"select size from tars where name is '{name}';") | ||
expected_size: int = cur.fetchall()[0][0] | ||
if expected_size > actual_size: | ||
# Rerun since tar was incomplete | ||
# Break out of the for-loop and run the try-except block again | ||
continue | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@golaz @chengzhuzhang Here is the primary piece of code for this pull request. It's very challenging to test behavior that only occurs when there is an error, but I believe I have this set up correctly. Please let me know if this looks good to you as well. Thanks! (I did add a unit test, but that doesn't create an error to test the except
block.)
docs/source/usage.rst
Outdated
* ``--retries`` to set the number of times to retry ``hsi get`` if a tar file is incomplete. | ||
The default is 1 retry (2 tries total). Note that the archive you're extracting from | ||
must have been created using ``zstash >= v1.1.0`` for this to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@golaz @chengzhuzhang Here is an explanation of the --retries
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be strictly the case and should be modified.
- The retry option (in case of hsi get failure) should be independent of the version of zstash.
- But forcing an hsi get by detecting an incomplete tar file requires zstash >= v1.1.10
zstash/hpss.py
Outdated
expected_size: int = cur.fetchall()[0][0] | ||
if expected_size > actual_size: | ||
# Rerun since tar was incomplete | ||
# Break out of the for-loop and run the try-except block again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this comment is from an earlier iteration of work; this should be changed to simply "Run the try-except block again"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the logic here. If hsi get returns a failure, we need to retry. Don't need to worry about tar file size and version of zstash here.
Edit: a more sophisticated option would be to additionally check the tar file size regardless of hsi get status (when supported). This could catch additional issues (hsi get completing without error, but returning an incomplete file). I don't know if that ever happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@golaz re: check the tar file size regardless of hsi get status
-- Sure, I can copy the if expected_size > actual_size:
logic to the try
block so that it is checked on both success and failure.
You also mentioned the logic should be at https://github.com/E3SM-Project/zstash/blob/main/zstash/extract.py#L435 (if not os.path.exists(tfname):
) instead, but I'm not seeing what effect that has. If there's no tfname
, then we have nothing to get a size from -- so I don't see what logic we would apply at this level. As for the case where tfname
does exist, we have: tfname
-> file_path
in hpss_get
-> file_path
in hpss_transfer
-> path, name = os.path.split(file_path)
-> actual_size = os.path.getsize(name)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@forsyth2, regarding the check around
Line 435 in 7534528
if not os.path.exists(tfname): |
I would modify to do the following: retrieve the tar from from HPSS
- if it is not in the disk cache (what we do now)
- OR if the file is on disk cache, but its size does not match the entry in the database (assuming version of zstash archive has that information).
This would help recover from previous errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code needs to be modified in two places to achieve the desired goals:
- When deciding whether the tar file needs to be retrieved from tape, check its size (if zstash database has the information).
- Retry if hsi get returns a failure. Regardless of file size and zstash version.
docs/source/usage.rst
Outdated
* ``--retries`` to set the number of times to retry ``hsi get`` if a tar file is incomplete. | ||
The default is 1 retry (2 tries total). Note that the archive you're extracting from | ||
must have been created using ``zstash >= v1.1.0`` for this to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be strictly the case and should be modified.
- The retry option (in case of hsi get failure) should be independent of the version of zstash.
- But forcing an hsi get by detecting an incomplete tar file requires zstash >= v1.1.10
zstash/hpss.py
Outdated
expected_size: int = cur.fetchall()[0][0] | ||
if expected_size > actual_size: | ||
# Rerun since tar was incomplete | ||
# Break out of the for-loop and run the try-except block again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the logic here. If hsi get returns a failure, we need to retry. Don't need to worry about tar file size and version of zstash here.
Edit: a more sophisticated option would be to additionally check the tar file size regardless of hsi get status (when supported). This could catch additional issues (hsi get completing without error, but returning an incomplete file). I don't know if that ever happens.
50a1ce5
to
4f54d5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@golaz This is ready for another review. Thanks!
zstash/extract.py
Outdated
if not os.path.exists(tfname): | ||
# Will need to retrieve from HPSS | ||
hpss_get(hpss, tfname, cache) | ||
elif cur and tars_table_exists(cur): | ||
logger.info( | ||
f"{tfname} exists. Checking expected size matches actual size." | ||
) | ||
actual_size = os.path.getsize(tfname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a test by doing the following:
Changed this code block to be:
if not os.path.exists(tfname):
# Will need to retrieve from HPSS
hpss_get(hpss, tfname, cache)
raise RuntimeError # NEW -- on retry, tfname will exist and logic will go into elif block
elif cur and tars_table_exists(cur):
logger.info(
f"{tfname} exists. Checking expected size matches actual size."
)
actual_size = os.path.getsize("/global/homes/f/forsyth/zstash_test/add_files.sh") # CHANGED -- this is a much smaller file, so the sizes won't match
and ran a modified version of the tutorial code:
$ emacs setup.sh
mkdir zstash_demo
mkdir zstash_demo/empty_dir
mkdir zstash_demo/dir
echo 'file0 stuff' > zstash_demo/file0.txt
echo '' > zstash_demo/file_empty.txt
echo 'file1 stuff' > zstash_demo/dir/file1.txt
$ emacs add_files.sh
mkdir zstash_demo/dir2
echo 'file2 stuff' > zstash_demo/dir2/file2.txt
echo 'file1 stuff with changes' > zstash_demo/dir/file1.txt
$ chmod 700 setup.sh
$ chmod 700 add_files.sh
$ ./setup.sh
$ zstash create --hpss=zstash_retries zstash_demo
$ ./add_files.sh
$ cd zstash_demo
$ zstash update --hpss=zstash_retries
$ cd ..
$ mkdir zstash_extraction2 && cd zstash_extraction2
$ zstash extract --hpss=zstash_retries --retries=3
zstash/extract.py
Outdated
f"select size from tars where name is '{name_only}';" | ||
) | ||
expected_size: int = cur.fetchall()[0][0] | ||
if expected_size > actual_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 2x2 check -- file exists & sizes match up.
- T & N/A: skip
hpss_get
- T & T: skip
hpss_get
- T & F: run
- F & N/A: run
- F & T:
runN/A -- can't compare sizes. - F & F: N/A -- can't compare sizes.
- T & N/A: skip
- Redo size check after
hpss_get
Perhaps have a flag variable should_retrieve
and then at the end call hpss_get
.
Nest logic:
if file doesn't exist:
do_retrieve = True # F -- all 3 cases
elif
if we can check the size:
if sizes match:
do_retrieve = False # T & T
else:
do_retrieve = True # T & F
else:
do_retrieve = False # T & N/A
if do_retrieve:
hpss_get
if we can check the size AND the sizes don't match:
raise exception ourselves
zstash/extract.py
Outdated
logger.info( | ||
f"{name_only}: expected size={expected_size} > {actual_size}=actual_size" | ||
) | ||
hpss_get(hpss, tfname, cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no tars_table
, if the previous hpss_get
had an exception, it won't retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@forsyth2: this looks good to me now. I like your function to verify the size of the tar files. It makes for for a simpler and cleaner implementation of the various checks.
Thanks @golaz! |
hsi get
retry -- resolves #198