hsi get retry #218

forsyth2 · 2022-08-29T23:32:14Z

hsi get retry -- resolves #198

forsyth2

@chengzhuzhang @golaz Luckily this issue doesn't require too many code changes, but I want to confirm a couple design choices -- see comments. Thanks!

forsyth2 · 2022-08-29T23:33:02Z

zstash/hpss.py

@@ -95,7 +95,14 @@ def hpss_transfer(
            # Transfer file using `hsi`
            command: str = 'hsi -q "cd {}; {} {}"'.format(hpss, transfer_command, name)
            error_str: str = "Transferring file {} HPSS: {}".format(transfer_word, name)
-            run_command(command, error_str)
+            if transfer_type == "get":


Do we want to retry any failing hsi operation, or only hsi get?

forsyth2 · 2022-08-29T23:34:04Z

zstash/hpss.py

+                    run_command(command, error_str)
+                except RuntimeError:
+                    # Retry if `hsi get` fails.
+                    run_command(command, error_str)


Does it make sense to always retry hsi get in the event of failure? Or do we indeed want a command line option to request this?

forsyth2 · 2022-08-31T22:36:24Z

#212 specifically addresses the incomplete tar error, whereas #198 is to simply retry hsi get. #217 occurred because of #212. This pull request should resolve all three issues.

We can add an additional test for the tar file being on disk (for backwards-compatibility, run check if tars table exists). We could make a command line option for number of retry attempts (default=1).

forsyth2 · 2022-09-09T21:22:58Z

zstash/hpss.py

+            tries = retries + 1
+            while tries > 0:
+                tries -= 1
+                try:
+                    run_command(command, error_str)
+                    # Command was successful. No need to retry.
+                    break
+                except RuntimeError as e:
+                    actual_size = os.path.getsize(name)
+                    if (
+                        (tries > 0)
+                        and (transfer_type == "get")
+                        and cur
+                        and tars_table_exists(cur)
+                    ):
+                        cur.execute(f"select size from tars where name is '{name}';")
+                        expected_size: int = cur.fetchall()[0][0]
+                        if expected_size > actual_size:
+                            # Rerun since tar was incomplete
+                            # Break out of the for-loop and run the try-except block again
+                            continue
+                    raise e


@golaz @chengzhuzhang Here is the primary piece of code for this pull request. It's very challenging to test behavior that only occurs when there is an error, but I believe I have this set up correctly. Please let me know if this looks good to you as well. Thanks! (I did add a unit test, but that doesn't create an error to test the except block.)

forsyth2 · 2022-09-09T21:23:38Z

docs/source/usage.rst

+* ``--retries`` to set the number of times to retry ``hsi get`` if a tar file is incomplete.
+  The default is 1 retry (2 tries total). Note that the archive you're extracting from
+  must have been created using ``zstash >= v1.1.0`` for this to work.


@golaz @chengzhuzhang Here is an explanation of the --retries option.

This does not need to be strictly the case and should be modified.

The retry option (in case of hsi get failure) should be independent of the version of zstash.

But forcing an hsi get by detecting an incomplete tar file requires zstash >= v1.1.10

forsyth2 · 2022-09-09T21:25:43Z

zstash/hpss.py

+                        expected_size: int = cur.fetchall()[0][0]
+                        if expected_size > actual_size:
+                            # Rerun since tar was incomplete
+                            # Break out of the for-loop and run the try-except block again


Note: this comment is from an earlier iteration of work; this should be changed to simply "Run the try-except block again"

I would change the logic here. If hsi get returns a failure, we need to retry. Don't need to worry about tar file size and version of zstash here.

Edit: a more sophisticated option would be to additionally check the tar file size regardless of hsi get status (when supported). This could catch additional issues (hsi get completing without error, but returning an incomplete file). I don't know if that ever happens.

@golaz re: check the tar file size regardless of hsi get status -- Sure, I can copy the if expected_size > actual_size: logic to the try block so that it is checked on both success and failure.

You also mentioned the logic should be at https://github.com/E3SM-Project/zstash/blob/main/zstash/extract.py#L435 (if not os.path.exists(tfname):) instead, but I'm not seeing what effect that has. If there's no tfname, then we have nothing to get a size from -- so I don't see what logic we would apply at this level. As for the case where tfname does exist, we have: tfname -> file_path in hpss_get -> file_path in hpss_transfer -> path, name = os.path.split(file_path) -> actual_size = os.path.getsize(name).

@forsyth2, regarding the check around

zstash/zstash/extract.py

Line 435 in 7534528

if not os.path.exists(tfname):

I would modify to do the following: retrieve the tar from from HPSS

if it is not in the disk cache (what we do now)

OR if the file is on disk cache, but its size does not match the entry in the database (assuming version of zstash archive has that information).

This would help recover from previous errors.

golaz

The code needs to be modified in two places to achieve the desired goals:

When deciding whether the tar file needs to be retrieved from tape, check its size (if zstash database has the information).
Retry if hsi get returns a failure. Regardless of file size and zstash version.

golaz · 2022-09-14T17:58:58Z

docs/source/usage.rst

+* ``--retries`` to set the number of times to retry ``hsi get`` if a tar file is incomplete.
+  The default is 1 retry (2 tries total). Note that the archive you're extracting from
+  must have been created using ``zstash >= v1.1.0`` for this to work.


This does not need to be strictly the case and should be modified.

The retry option (in case of hsi get failure) should be independent of the version of zstash.

But forcing an hsi get by detecting an incomplete tar file requires zstash >= v1.1.10

golaz · 2022-09-14T18:00:50Z

zstash/hpss.py

+                        expected_size: int = cur.fetchall()[0][0]
+                        if expected_size > actual_size:
+                            # Rerun since tar was incomplete
+                            # Break out of the for-loop and run the try-except block again


I would change the logic here. If hsi get returns a failure, we need to retry. Don't need to worry about tar file size and version of zstash here.

Edit: a more sophisticated option would be to additionally check the tar file size regardless of hsi get status (when supported). This could catch additional issues (hsi get completing without error, but returning an incomplete file). I don't know if that ever happens.

forsyth2

@golaz This is ready for another review. Thanks!

zstash/extract.py

forsyth2 · 2022-09-23T19:20:07Z

zstash/extract.py

+                    if not os.path.exists(tfname):
+                        # Will need to retrieve from HPSS
+                        hpss_get(hpss, tfname, cache)
+                    elif cur and tars_table_exists(cur):
+                        logger.info(
+                            f"{tfname} exists. Checking expected size matches actual size."
+                        )
+                        actual_size = os.path.getsize(tfname)


I did a test by doing the following:

Changed this code block to be:

if not os.path.exists(tfname): # Will need to retrieve from HPSS hpss_get(hpss, tfname, cache) raise RuntimeError # NEW -- on retry, tfname will exist and logic will go into elif block elif cur and tars_table_exists(cur): logger.info( f"{tfname} exists. Checking expected size matches actual size." ) actual_size = os.path.getsize("/global/homes/f/forsyth/zstash_test/add_files.sh") # CHANGED -- this is a much smaller file, so the sizes won't match

and ran a modified version of the tutorial code:

$ emacs setup.sh mkdir zstash_demo mkdir zstash_demo/empty_dir mkdir zstash_demo/dir echo 'file0 stuff' > zstash_demo/file0.txt echo '' > zstash_demo/file_empty.txt echo 'file1 stuff' > zstash_demo/dir/file1.txt $ emacs add_files.sh mkdir zstash_demo/dir2 echo 'file2 stuff' > zstash_demo/dir2/file2.txt echo 'file1 stuff with changes' > zstash_demo/dir/file1.txt $ chmod 700 setup.sh $ chmod 700 add_files.sh $ ./setup.sh $ zstash create --hpss=zstash_retries zstash_demo $ ./add_files.sh $ cd zstash_demo $ zstash update --hpss=zstash_retries $ cd .. $ mkdir zstash_extraction2 && cd zstash_extraction2 $ zstash extract --hpss=zstash_retries --retries=3

forsyth2 · 2022-09-29T20:59:44Z

zstash/extract.py

+                            f"select size from tars where name is '{name_only}';"
+                        )
+                        expected_size: int = cur.fetchall()[0][0]
+                        if expected_size > actual_size:


2x2 check -- file exists & sizes match up.

T & N/A: skip hpss_get

T & T: skip hpss_get

T & F: run

F & N/A: run

F & T: ~~run~~ N/A -- can't compare sizes.

F & F: N/A -- can't compare sizes.

Redo size check after hpss_get

Perhaps have a flag variable should_retrieve and then at the end call hpss_get.

Nest logic:

if file doesn't exist: do_retrieve = True # F -- all 3 cases elif if we can check the size: if sizes match: do_retrieve = False # T & T else: do_retrieve = True # T & F else: do_retrieve = False # T & N/A if do_retrieve: hpss_get if we can check the size AND the sizes don't match: raise exception ourselves

forsyth2 · 2022-09-29T21:00:16Z

zstash/extract.py

+                            logger.info(
+                                f"{name_only}: expected size={expected_size} > {actual_size}=actual_size"
+                            )
+                            hpss_get(hpss, tfname, cache)


If there's no tars_table, if the previous hpss_get had an exception, it won't retry.

forsyth2 · 2022-09-30T00:04:50Z

@golaz Updated retry logic in the latest commit -- 4b0a295.

If that looks good, can you approve the PR? Then, I can merge. Thanks!

golaz

@forsyth2: this looks good to me now. I like your function to verify the size of the tar files. It makes for for a simpler and cleaner implementation of the various checks.

forsyth2 · 2022-10-13T00:05:00Z

Thanks @golaz!

forsyth2 added enhancement semver: new feature New feature (will increment minor version) priority: high High priority task labels Aug 29, 2022

forsyth2 self-assigned this Aug 29, 2022

forsyth2 commented Aug 29, 2022

View reviewed changes

forsyth2 force-pushed the hsi-retry branch 2 times, most recently from 20dfd52 to e8647e0 Compare September 9, 2022 21:21

forsyth2 commented Sep 9, 2022

View reviewed changes

forsyth2 requested review from chengzhuzhang and golaz September 9, 2022 21:24

forsyth2 commented Sep 9, 2022

View reviewed changes

golaz requested changes Sep 14, 2022

View reviewed changes

forsyth2 force-pushed the hsi-retry branch 4 times, most recently from 50a1ce5 to 4f54d5b Compare September 23, 2022 19:02

forsyth2 commented Sep 23, 2022

View reviewed changes

forsyth2 commented Sep 29, 2022

View reviewed changes

forsyth2 removed the Small improvement label Oct 12, 2022

golaz approved these changes Oct 12, 2022

View reviewed changes

hsi get retry

44f4317

forsyth2 force-pushed the hsi-retry branch from 4b0a295 to 44f4317 Compare October 13, 2022 00:03

forsyth2 merged commit f5b2495 into main Oct 13, 2022

forsyth2 deleted the hsi-retry branch October 13, 2022 00:05

This was referenced Oct 31, 2022

Handle incomplete tar files #212

Closed

"file could not be opened successfully" on a v2.LR.historical_0101 tar ball #217

Closed

forsyth2 mentioned this pull request Jun 1, 2023

Add retries option for check #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hsi get retry #218

hsi get retry #218

forsyth2 commented Aug 29, 2022

forsyth2 left a comment

forsyth2 Aug 29, 2022

forsyth2 Aug 29, 2022 •

edited

Loading

forsyth2 commented Aug 31, 2022

forsyth2 Sep 9, 2022

forsyth2 Sep 9, 2022

golaz Sep 14, 2022

forsyth2 Sep 9, 2022

golaz Sep 14, 2022 •

edited

Loading

forsyth2 Sep 16, 2022

golaz Sep 19, 2022

golaz left a comment

golaz Sep 14, 2022

golaz Sep 14, 2022 •

edited

Loading

forsyth2 left a comment

forsyth2 Sep 23, 2022 •

edited

Loading

forsyth2 Sep 29, 2022

forsyth2 Sep 29, 2022 •

edited

Loading

forsyth2 Sep 29, 2022

forsyth2 commented Sep 30, 2022

golaz left a comment

forsyth2 commented Oct 13, 2022

hsi get retry #218

hsi get retry #218

Conversation

forsyth2 commented Aug 29, 2022

forsyth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forsyth2 Aug 29, 2022 • edited Loading

Choose a reason for hiding this comment

forsyth2 commented Aug 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golaz Sep 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golaz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golaz Sep 14, 2022 • edited Loading

Choose a reason for hiding this comment

forsyth2 left a comment

Choose a reason for hiding this comment

forsyth2 Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forsyth2 Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forsyth2 commented Sep 30, 2022

golaz left a comment

Choose a reason for hiding this comment

forsyth2 commented Oct 13, 2022

forsyth2 Aug 29, 2022 •

edited

Loading

golaz Sep 14, 2022 •

edited

Loading

golaz Sep 14, 2022 •

edited

Loading

forsyth2 Sep 23, 2022 •

edited

Loading

forsyth2 Sep 29, 2022 •

edited

Loading