Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hsi get retry #218

Merged
merged 1 commit into from
Oct 13, 2022
Merged

hsi get retry #218

merged 1 commit into from
Oct 13, 2022

Conversation

forsyth2
Copy link
Collaborator

hsi get retry -- resolves #198

@forsyth2 forsyth2 added enhancement semver: new feature New feature (will increment minor version) priority: high High priority task labels Aug 29, 2022
@forsyth2 forsyth2 self-assigned this Aug 29, 2022
Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang @golaz Luckily this issue doesn't require too many code changes, but I want to confirm a couple design choices -- see comments. Thanks!

zstash/hpss.py Outdated
@@ -95,7 +95,14 @@ def hpss_transfer(
# Transfer file using `hsi`
command: str = 'hsi -q "cd {}; {} {}"'.format(hpss, transfer_command, name)
error_str: str = "Transferring file {} HPSS: {}".format(transfer_word, name)
run_command(command, error_str)
if transfer_type == "get":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to retry any failing hsi operation, or only hsi get?

zstash/hpss.py Outdated
run_command(command, error_str)
except RuntimeError:
# Retry if `hsi get` fails.
run_command(command, error_str)
Copy link
Collaborator Author

@forsyth2 forsyth2 Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to always retry hsi get in the event of failure? Or do we indeed want a command line option to request this?

@forsyth2
Copy link
Collaborator Author

#212 specifically addresses the incomplete tar error, whereas #198 is to simply retry hsi get. #217 occurred because of #212. This pull request should resolve all three issues.

We can add an additional test for the tar file being on disk (for backwards-compatibility, run check if tars table exists). We could make a command line option for number of retry attempts (default=1).

@forsyth2 forsyth2 force-pushed the hsi-retry branch 2 times, most recently from 20dfd52 to e8647e0 Compare September 9, 2022 21:21
zstash/hpss.py Outdated
Comment on lines 101 to 110
tries = retries + 1
while tries > 0:
tries -= 1
try:
run_command(command, error_str)
# Command was successful. No need to retry.
break
except RuntimeError as e:
actual_size = os.path.getsize(name)
if (
(tries > 0)
and (transfer_type == "get")
and cur
and tars_table_exists(cur)
):
cur.execute(f"select size from tars where name is '{name}';")
expected_size: int = cur.fetchall()[0][0]
if expected_size > actual_size:
# Rerun since tar was incomplete
# Break out of the for-loop and run the try-except block again
continue
raise e
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz @chengzhuzhang Here is the primary piece of code for this pull request. It's very challenging to test behavior that only occurs when there is an error, but I believe I have this set up correctly. Please let me know if this looks good to you as well. Thanks! (I did add a unit test, but that doesn't create an error to test the except block.)

Comment on lines 307 to 309
* ``--retries`` to set the number of times to retry ``hsi get`` if a tar file is incomplete.
The default is 1 retry (2 tries total). Note that the archive you're extracting from
must have been created using ``zstash >= v1.1.0`` for this to work.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz @chengzhuzhang Here is an explanation of the --retries option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be strictly the case and should be modified.

  • The retry option (in case of hsi get failure) should be independent of the version of zstash.
  • But forcing an hsi get by detecting an incomplete tar file requires zstash >= v1.1.10

zstash/hpss.py Outdated
expected_size: int = cur.fetchall()[0][0]
if expected_size > actual_size:
# Rerun since tar was incomplete
# Break out of the for-loop and run the try-except block again
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this comment is from an earlier iteration of work; this should be changed to simply "Run the try-except block again"

Copy link
Collaborator

@golaz golaz Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the logic here. If hsi get returns a failure, we need to retry. Don't need to worry about tar file size and version of zstash here.

Edit: a more sophisticated option would be to additionally check the tar file size regardless of hsi get status (when supported). This could catch additional issues (hsi get completing without error, but returning an incomplete file). I don't know if that ever happens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz re: check the tar file size regardless of hsi get status -- Sure, I can copy the if expected_size > actual_size: logic to the try block so that it is checked on both success and failure.

You also mentioned the logic should be at https://github.com/E3SM-Project/zstash/blob/main/zstash/extract.py#L435 (if not os.path.exists(tfname):) instead, but I'm not seeing what effect that has. If there's no tfname, then we have nothing to get a size from -- so I don't see what logic we would apply at this level. As for the case where tfname does exist, we have: tfname -> file_path in hpss_get -> file_path in hpss_transfer -> path, name = os.path.split(file_path) -> actual_size = os.path.getsize(name).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2, regarding the check around

if not os.path.exists(tfname):

I would modify to do the following: retrieve the tar from from HPSS

  • if it is not in the disk cache (what we do now)
  • OR if the file is on disk cache, but its size does not match the entry in the database (assuming version of zstash archive has that information).

This would help recover from previous errors.

Copy link
Collaborator

@golaz golaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code needs to be modified in two places to achieve the desired goals:

  • When deciding whether the tar file needs to be retrieved from tape, check its size (if zstash database has the information).
  • Retry if hsi get returns a failure. Regardless of file size and zstash version.

Comment on lines 307 to 309
* ``--retries`` to set the number of times to retry ``hsi get`` if a tar file is incomplete.
The default is 1 retry (2 tries total). Note that the archive you're extracting from
must have been created using ``zstash >= v1.1.0`` for this to work.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be strictly the case and should be modified.

  • The retry option (in case of hsi get failure) should be independent of the version of zstash.
  • But forcing an hsi get by detecting an incomplete tar file requires zstash >= v1.1.10

zstash/hpss.py Outdated
expected_size: int = cur.fetchall()[0][0]
if expected_size > actual_size:
# Rerun since tar was incomplete
# Break out of the for-loop and run the try-except block again
Copy link
Collaborator

@golaz golaz Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the logic here. If hsi get returns a failure, we need to retry. Don't need to worry about tar file size and version of zstash here.

Edit: a more sophisticated option would be to additionally check the tar file size regardless of hsi get status (when supported). This could catch additional issues (hsi get completing without error, but returning an incomplete file). I don't know if that ever happens.

@forsyth2 forsyth2 force-pushed the hsi-retry branch 4 times, most recently from 50a1ce5 to 4f54d5b Compare September 23, 2022 19:02
Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz This is ready for another review. Thanks!

zstash/extract.py Show resolved Hide resolved
Comment on lines 458 to 465
if not os.path.exists(tfname):
# Will need to retrieve from HPSS
hpss_get(hpss, tfname, cache)
elif cur and tars_table_exists(cur):
logger.info(
f"{tfname} exists. Checking expected size matches actual size."
)
actual_size = os.path.getsize(tfname)
Copy link
Collaborator Author

@forsyth2 forsyth2 Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test by doing the following:

Changed this code block to be:

                    if not os.path.exists(tfname):
                        # Will need to retrieve from HPSS
                        hpss_get(hpss, tfname, cache)
                        raise RuntimeError # NEW -- on retry, tfname will exist and logic will go into elif block
                    elif cur and tars_table_exists(cur):
                        logger.info(
                            f"{tfname} exists. Checking expected size matches actual size."
                        )
                        actual_size = os.path.getsize("/global/homes/f/forsyth/zstash_test/add_files.sh") # CHANGED -- this is a much smaller file, so the sizes won't match

and ran a modified version of the tutorial code:

$ emacs setup.sh
mkdir zstash_demo
mkdir zstash_demo/empty_dir
mkdir zstash_demo/dir
echo 'file0 stuff' > zstash_demo/file0.txt
echo '' > zstash_demo/file_empty.txt
echo 'file1 stuff' > zstash_demo/dir/file1.txt
$ emacs add_files.sh
mkdir zstash_demo/dir2
echo 'file2 stuff' > zstash_demo/dir2/file2.txt
echo 'file1 stuff with changes' > zstash_demo/dir/file1.txt
$ chmod 700 setup.sh
$ chmod 700 add_files.sh

$ ./setup.sh
$ zstash create --hpss=zstash_retries zstash_demo
$ ./add_files.sh
$ cd zstash_demo
$ zstash update --hpss=zstash_retries
$ cd ..
$ mkdir zstash_extraction2 && cd zstash_extraction2
$ zstash extract --hpss=zstash_retries --retries=3

f"select size from tars where name is '{name_only}';"
)
expected_size: int = cur.fetchall()[0][0]
if expected_size > actual_size:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!=

Copy link
Collaborator Author

@forsyth2 forsyth2 Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 2x2 check -- file exists & sizes match up.
    • T & N/A: skip hpss_get
    • T & T: skip hpss_get
    • T & F: run
    • F & N/A: run
    • F & T: run N/A -- can't compare sizes.
    • F & F: N/A -- can't compare sizes.
  • Redo size check after hpss_get

Perhaps have a flag variable should_retrieve and then at the end call hpss_get.

Nest logic:

if file doesn't exist:
  do_retrieve = True # F -- all 3 cases
elif 
  if we can check the size:
    if sizes match:
      do_retrieve = False # T & T
   else:
      do_retrieve = True # T & F
 else:
   do_retrieve = False # T & N/A
if do_retrieve:
  hpss_get
  if we can check the size AND the sizes don't match:
    raise exception ourselves

logger.info(
f"{name_only}: expected size={expected_size} > {actual_size}=actual_size"
)
hpss_get(hpss, tfname, cache)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no tars_table, if the previous hpss_get had an exception, it won't retry.

@forsyth2
Copy link
Collaborator Author

@golaz Updated retry logic in the latest commit -- 4b0a295.

If that looks good, can you approve the PR? Then, I can merge. Thanks!

Copy link
Collaborator

@golaz golaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2: this looks good to me now. I like your function to verify the size of the tar files. It makes for for a simpler and cleaner implementation of the various checks.

@forsyth2
Copy link
Collaborator Author

Thanks @golaz!

@forsyth2 forsyth2 merged commit f5b2495 into main Oct 13, 2022
@forsyth2 forsyth2 deleted the hsi-retry branch October 13, 2022 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high High priority task semver: new feature New feature (will increment minor version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to retry after hsi get failure
2 participants