Non block testing #355

TonyB9000 · 2024-12-19T18:59:20Z

Issue resolution

Closes zstash is always non-blocking #290 (eventually)

Properly implements Blocking and Non-Blocking behavior in archive creation/transfer.

task_wait timeout and polling_interval should be increased to 7200 and 10, respectively)

Still leaves open-ended issue when return from "waiting too long" (after max_retries, abort Y/N?)

Could be stress-tested with MUCH larger files, where weeks are required for transfers to complete.

Many added "logger.info" messages could be eliminated, or made "logger.debug" messages. All added messages use the format-string syntax

message = f"var = {var}"
rather than
message = "var = {}".format(var)

syntax.

Required:

[maybe] Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
[no] Testing: I have added at least one automated test. Every objective above is represented in at least one test.
[no] Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

[no] Testing: this pull request adds at least one new possible command line option. I have tested using this option with and without any other option that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

[yes] Logic: I have visually inspected the entire pull request myself.
[in_progress] Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zstash/conda, not just an import statement.

3. Is this well documented?

Required:

[maybe] Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

[maybe] Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
[no] Pre-commit checks: All the pre-commits checks have passed.

If applicable:

Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

…e-file when blocking success, added more logging.

…sk_wait on blocking, force globus_transfer to return a sting status, added more logging.

TonyB9000 · 2024-12-19T19:02:39Z

(Wot a hassle!)

forsyth2

Thanks @TonyB9000! This was quite a logically complex piece of zstash to work on.

My analysis of the call hierarchy

There's a lot of code to follow here, but I think I've finally managed to understand all the major pieces. I've included thorough notes on that below. (Notes below are primarily for myself, but they might be helpful to others).

Verbose call hierarchy

create()
    logger.debug(f"{ts_utc()}: Running zstash create")
    logger.debug(f"{ts_utc()}:Calling globus_activate(hpss)")
    globus_activate(hpss)
        # No changes in this PR
    logger.debug(f"{ts_utc()}: Creating local cache directory")
    logger.debug(f"{ts_utc()}: Calling create_database()")
    failures: List[str] = create_database(cache, args)
        logger.debug(f"{ts_utc()}:Creating index database")
        failures = add_files(..., non_blocking=args.non_blocking)
            def add_files(..., non_blocking: bool = False):
            logger.info(f"{ts_utc()}: Creating new tar archive {tfname}")
            if i == nfiles - 1 or tarsize + next_file_size > maxsize:
                logger.debug(f"{ts_utc()}: Closing tar archive {tfname}")
                logger.info(f"{ts_utc()}: (add_files): Completed archive file {tfname}")
                logger.info(f"{ts_utc()}: DIVING: (add_files): Calling hpss_put to dispatch archive file {tfname}")
                hpss_put(hpss, os.path.join(cache, tfname), cache, keep, non_blocking)
                    # See hpss_put a few lines down
                logger.info(f"{ts_utc()}: SURFACE (add_files): Called hpss_put to dispatch archive file {tfname}")
    logger.debug(f"{ts_utc()}: calling hpss_put() for {get_db_filename(cache)}")
    hpss_put(hpss, get_db_filename(cache), cache, keep=True)
        def hpss_put(..., non_blocking: bool = False):
        hpss_transfer(hpss, file_path, "put", cache, keep, non_blocking)
            def hpss_transfer(..., non_blocking: bool = False):
            globus_status = "UNKNOWN"
            logger.info(f"{ts_utc()}: DIVING: hpss calls globus_transfer(name={name})")
            globus_status = globus_transfer(endpoint, url_path, name, transfer_type, non_blocking)
                # Changed in this PR!!!
                def globus_transfer(..., non_blocking: bool):
                # Bulk of the changes in the PR are in this logic
            if not keep and scheme == "globus" and globus_status == "SUCCEEDED" and not non_blocking:
                # We should not keep the local file, so delete it now that it is on HPSS
                os.remove(file_path)
            logger.info(f"{ts_utc()}: SURFACE hpss globus_transfer(name={name}) returns")
    logger.debug(f"{ts_utc()}: calling globus_finalize()")
    globus_finalize(non_blocking=args.non_blocking)
        # No changes in this PR

Simplified call hierarchy

create()
    globus_activate(hpss)
        # No changes in this PR
    failures: List[str] = create_database(cache, args)
        failures = add_files(..., non_blocking=args.non_blocking)
            if i == nfiles - 1 or tarsize + next_file_size > maxsize:
                hpss_put(hpss, os.path.join(cache, tfname), cache, keep, non_blocking)
                    hpss_transfer(hpss, file_path, "put", cache, keep, non_blocking)
                        globus_status = "UNKNOWN"
                        globus_status = globus_transfer(endpoint, url_path, name, transfer_type, non_blocking)
                            # Changed in this PR!!!
                            # Bulk of the changes in the PR are in this logic
                        if not keep and scheme == "globus" and globus_status == "SUCCEEDED" and not non_blocking:
                            # We should not keep the local file, so delete it now that it is on HPSS
                            os.remove(file_path)
    hpss_put(hpss, get_db_filename(cache), cache, keep=True)
        # See hpss_put line above
        # This block adds the db file
    globus_finalize(non_blocking=args.non_blocking)
        # No changes in this PR

Tony's pseudo-code

1   While Files_to_archive
2       Make_a_tar_file (stuff while tar size < MAXSIZE)
3       Transfer_a_tar_file
4           Add tar_file to TransferData items (create if needed)
5           if a Transfer is ACTIVE
6               return to Make another tar file
7           else
8               Start a transfer (dir of tar-files)
9           If BLOCKING
10              Try to wait for transfer completion
11          return Task_Status
12      Check Task_Status

Matching the pseudo-code to the real code

hpss_utils.py add_files
    # 1. While Files_to_archive
    for i in range(nfiles):
        if create_new_tar:
            create_new_tar = False
            # 2. Make_a_tar_file (stuff while tar size < MAXSIZE)
            tar = tarfile.open(mode="w", fileobj=tarFileObject, dereference=follow_symlinks) 
        hpss_utils.py add_file(tar, current_file, follow_symlinks)
        tarsize = tarFileObject.tell()
        if i == nfiles - 1 or tarsize + next_file_size > maxsize:
            hpss.py hpss_put(hpss, os.path.join(cache, tfname), cache, keep, non_blocking)
                hpss.py hpss_transfer(hpss, file_path, "put", cache, keep, non_blocking)
                    # 3. Transfer_a_tar_file
                    globus.py globus_transfer(endpoint, url_path, name, transfer_type, non_blocking)
                        # 4. Add tar_file to TransferData items (create if needed)
                        # (the tar file name is contained in src_path and dst_path)
                        transfer_data.add_item(src_path, dst_path)
                        if task_id:
                            task = transfer_client.get_task(task_id)
                            # 5. if a Transfer is ACTIVE
                            if task["status"] == "ACTIVE":
                                # 6. return to Make another tar file
                                # (moving on to file i+1)
                                return "ACTIVE"
                        # 7. else
                        # 8. Start a transfer (dir of tar-files)
                        task = submit_transfer_with_checks(transfer_data)
                        task_id = task.get("task_id")
                        transfer_data = None
                        # 9. If BLOCKING
                        if not non_blocking:
                            # 10. Try to wait for transfer completion
                            while retry_count < max_retries:
                                transfer_client.task_wait(task_id, timeout=wait_timeout, polling_interval=1)
                                if task_status == "SUCCEEDED":
                                    break
                        # 11. return Task_Status
                        return task_status
                    # 12. Check Task_Status
                    if not keep and scheme == "globus" and globus_status == "SUCCEEDED" and not non_blocking:
                        os.remove(file_path)
            create_new_tar = True

Example code walk-throughs

Say we have 8 equally sized files. 3 files fit into a tar of maxsize.

Non-blocking

Process file1
    Create tar1
    Add file1 to tar1
Process file2
    Add file2 to tar1
Process file3
    Add file3 to tar1
    # We've reached the threshold to submit a tar
    Add tar1 to list of items to transfer
    # task_id does not yet exist
    Submit transfer of list of items to transfer # [tar1 (containing files 1-3)]
Process file4
    Create tar2
    Add file4 to tar2
Process file5
    Add file5 to tar2
Process file6
    Add file6 to tar2
    # We've reached the threshold to submit a tar
    Add tar2 to list of items to transfer
    # task_id exists because tar1 was submitted
    # task_id is ACTIVE, because tar1 is still transferring
    So, we percolate back up the call stack, all the way to the next file in the add_files loop
Process file7
    Create tar3
    Add file7 to tar3
    # We haven't reached the threshold to submit a tar
    # (We have with tar2, but we only care about tar3 now!)
Process file8
    Add file8 to tar3
    # We haven't reached the threshold to submit tar3, 
    # BUT we have hit the last file
    # task_id exists because tar1 was submitted
    Case 1:
        # task_id is ACTIVE, because tar1 is still transferring
        So, we percolate back up the call stack, all the way to the next file in the add_files loop
        # Of course, this was the last file, so we get back to work in create.py
        # (But the global variable transfer_data reminds us we still have data to transfer)
        # We `hpss_put` the index.db
        # And then `globus_finalize` does the final submission,
        # regardless of if the previous transfer is still going!!
        # In that case, there would be two simultaneous Globus transfers.
        Submit transfer of list of items to transfer # [tar2 (containing files 4-6), tar3 (containing files 7-8)]
    Case 2:
        # task_id is SUCCEEDED, because tar1 has finished transferring
        Submit transfer of list of items to transfer # [tar2 (containing files 4-6), tar3 (containing files 7-8)]
        # From there, we can get back to work in create.py
        # We `hpss_put` the index.db
        # And then `globus_finalize` does nothing, because we already did the final transfer

Blocking

For reference, the relevant code block from above is:

                        # 9. If BLOCKING
                        if not non_blocking:
                            # 10. Try to wait for transfer completion
                            while retry_count < max_retries:
                                transfer_client.task_wait(task_id, timeout=wait_timeout, polling_interval=1)
                                if task_status == "SUCCEEDED":
                                    break

Process file1
    Create tar1
    Add file1 to tar1
Process file2
    Add file2 to tar1
Process file3
    Add file3 to tar1
    # We've reached the threshold to submit a tar
    Add tar1 to list of items to transfer
    # task_id does not yet exist
    Submit transfer of list of items to transfer # [tar1 (containing files 1-3)]
    # We are in the blocking case!
    # JUST KEEP WAITING, until we hit max_retries
    Case 1:
        # Transfer did complete
        # We move on with task["status"] == "SUCCEEDED"
    Case 2:
        # Transfer did not complete
        # We move on with task["status"] == "ACTIVE"
        # At this point, we have artificially entered non-blocking behavior
        # I.e., just go up to the earlier code block I have for non-blocking
Process file4
    Create tar2
    Add file4 to tar2
Process file5
    Add file5 to tar2
Process file6
    Add file6 to tar2
    # We've reached the threshold to submit a tar
    Add tar2 to list of items to transfer
    # task_id exists because tar1 was submitted
    # task_id is SUCCEEDED, because we blocked until tar1 finished transferring
    Submit transfer of list of items to transfer # [tar2 (containing files 4-6)]
    # We are in the blocking case!
    # Same thing as above for tar1
Process file7
    Create tar3
    Add file7 to tar3
Process file8
    Add file8 to tar3
    # We haven't reached the threshold to submit tar3, 
    # BUT we have hit the last file
    # task_id exists because tar2 was submitted
    # task_id is SUCCEEDED, because we blocked until tar2 finished transferring
    Submit transfer of list of items to transfer # [tar3 (containing files 7-8)]
    # We are in the blocking case!
    # Same thing as above for tar1
    # From there, We can get back to work in create.py
    # We `hpss_put` the index.db
    # And then `globus_finalize` does nothing, because we already did the final transfer

Tony, do I have the logic of the two example walk-throughs correct?

Testing

From the above, it appears the 3 major test cases would be:

Non-blocking: transfer one tar, build up a list of multiple tars while that tar is transferring
- that first transfer is finished before we do the final transfer => immediately do last transfer (of multiple tars) and globus_finalize doesn't have to do anything
- that first transfer is still going once we've added all the files to tars => use globus_finalize to do the last transfer (meaning there would be a second Globus transfer in parallel)
Blocking: transfer one tar, transfer another tar, transfer another tar

We should add an integration test for this too, but that will require some thought in order to make sure we actually hit the conditions. I'm hoping we could do that without some exorbitant amount of testing data. The trickier thing however is that alot depends on timing, which is very challenging to test. I suppose proper use of sleep commands could work..., but even then I don't think we could introduce that to the Globus transfer times....

Other notes

I think it's good to include the extra logging. We can just set set in-depth logging lines to be debug-level.

I've also included a few in-line comments as part of this review.

forsyth2 · 2024-12-20T01:36:31Z

zstash/create.py

            globus_activate(hpss)
        else:
            # config.hpss is not "none", so we need to
            # create target HPSS directory
-            logger.debug("Creating target HPSS directory")
+            logger.debug(f"{ts_utc()}:Creating target HPSS directory {hpss}")


Add space after : for consistency.

@forsyth2 I expect to make many edits such as this before full acceptance. My method is to make the changes in my local (chrysalis) repo and commit/push them, rather than make the changes in the web/ui as that confuses me. For instance, I added "DIVING" and "SURFACE" as log-terms just before and after a major function call, so I could be sure when and where processing was occurring, and I expect to remove those terms.

My method is to make the changes in my local (chrysalis) repo and commit/push them, rather than make the changes in the web/ui

Agreed, I like that method better too.

forsyth2 · 2024-12-20T01:37:46Z

zstash/globus.py

                    )
                )
            else:
-                logger.error("Transfer FAILED")
+                logger.error(f"{ts_utc()}: Transfer FAILED (task_id = {task_id})")


I would consider updating this error logging to include what the actual task["status"] was, because it's not necessarily "FAILED"

Agreed! Will modify to issue status.
NOTE: An issue with this entire code-block is that we may or may-not return, may or may-not retain the previous task_id, etc. Logging needs to clarify which is which.

forsyth2 · 2024-12-20T01:40:01Z

zstash/globus.py

+        wait_timeout = 5
+        max_retries = 3
+        retry_count = 0
+        while retry_count < max_retries:


I don't really understand this part. So, we just give up after max_retries * wait_timeout = 3*5 = 15 seconds? And pretend like we wanted to be non-blocking? Don't Globus transfers typically take hours?

These values were made artificially small for test purposes (rather than incur massive test delays.) You don's see it here, but in create.py I had changed MAXSIZE from nGB to n100,000 bytes, to force multiple tarfile generations quickly.

If you don't use a "while ... retry < max_retries" loop, and the transfer hangs, then zstash hangs. You want to avoid that.

I'll provide more reasonable values here.

Oh I see, that makes sense, thanks.

…ameters

TonyB9000 · 2024-12-20T19:03:58Z

@forsyth2 @golaz I addressed Ryan's comments (I think), but also decided to make a function for the new "globus_block_wait()", to help clarify the code. Retested successfully in both blocking and non-blocking modes. Review should go quickly.

Feel free to suggest additional edits, etc.

…omment in hpss

forsyth2

@TonyB9000 I added a new commit (633a543) that addresses the pre-commit failures. I also ran the unit tests but am still getting some failures that I need to debug.

To try these things yourself, run:

# Make sure everything for Globus is working
cd <some directory>
mkdir zstash_demo; echo 'file0 stuff' > zstash_demo/file0.txt
rm ~/.globus-native-apps.cfg
# Load Unified, so we can use a known-working version of zstash
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh
zstash create --hpss=globus://6c54cade-bde5-45c1-bdea-f4bd71dba2cc/~/manual_run zstash_demo
# Paste the two auth codes

# Set up conda dev environment
cd <zstash-clone>
conda clean --all --y
conda env create -f conda/dev.yml -n zstash_pr355
conda activate zstash_pr355

# Run pre-commit checks
pre-commit run --all-files
# Keep running this until it auto-fixes everything or gives you actionable requests.

# Run the unit tests
pip install .
python -u -m unittest tests/test_*.py 
# I'm currently getting `Ran 69 tests`, `failures=34`

We should include the test cases in the PR. If it's too difficult to convert into the unit test framework, at least adding a script to run manually would be useful.

forsyth2 · 2024-12-27T19:47:21Z

zstash/globus.py

+        # except GlobusHTTPError as e:
+        #     logger.error(f"Exception: {e}")


This error doesn't seem to exist in any package I can find. Is it a typo?

It was likely borrowed from the interwebs as I crafted that whole try-block. It was the only "exception" listed, which I thought was weird (what about other exceptions?) I would be happy to eliminate it, and go with the generic "any" exception I added from seeing other examples.

forsyth2 · 2024-12-27T19:50:54Z

zstash/hpss.py

+            if (
+                not keep
+                and scheme == "globus"
+                and globus_status == "SUCCEEDED"
+                and not non_blocking
+            ):
                # We should not keep the local file, so delete it now that it is on HPSS
                os.remove(file_path)
            if not keep and scheme != "globus":
                # We should not keep the local file, so delete it now that it is on HPSS
                os.remove(file_path)


Consider changing the logic to the following to make it somewhat easier to follow:

if not keep: if scheme != "globus": os.remove(file_path) elif globus_status == "SUCCEEDED" and not non_blocking: os.remove(file_path)

or alternatively make use of short-circuit evaluation:

if not keep: if (scheme != "globus") or (globus_status == "SUCCEEDED" and not non_blocking): os.remove(file_path)

Yes, the only reason I kept it as I did was that the second "if" statement was there originally, and I was not sure if it was a good idea to reduce the logic (in case my preceding clause needed to be backed out.) I will consolidate.

(ACK! chrysalis tells me "disk quota exceeded" - I cannot save anything. ?!?)

OK (have space, will travel). I hope I got this correct. My first try to push suggested changes demanded I first pull down your latest commits. I did so - but had to negotiate a merge/conflict (never sure when to use rebase/fast-forward/etc).

Thanks @TonyB9000

demanded I first pull down your latest commits. I did so - but had to negotiate a merge/conflict (never sure when to use rebase/fast-forward/etc).

I generally use git rebase rather than git pull

git fetch upstream branch-name git rebase upstream/branch-name git status # Merge conflicts will be listed; fix those git add -A # or just add the changed files individually git rebase --continue

If you have a lot of commits, this may need to be repeated several times. If the commit history is not very important (it's usually not for commits on a single PR), you can change the history to be just a single commit:

git log git rebase -i <hash of last commit you don't want to roll into one> # Keep the `p` (for pick) for the first commit listed # Replace remaining `p` with `f` (for fixup -- this will roll the commit into the one above it) # Now, run the code block above

@forsyth2 Thanks Ryan. I'll add the rebase sequence to my personal "git how-to" page.

forsyth2 · 2025-01-07T16:04:23Z

@TonyB9000 The unit test failures I was running into turned out to have a simple fix. I just pushed the commit; the expected logging statements were just different because of the added timestamps.

The unit tests now pass, but I still think we should have some sort of integration test or at the very least, a README or shell script to test manually. I.e., would it be possible to add a script/directions of how you've been testing this issue? Thanks!

Other than that, this is probably good to merge.

golaz

Approving based on discussion during meetings and cursory review of the code changes. I will test under real world conditions as part of the next RC of e3sm unified.

TonyB9000 · 2025-01-08T19:49:41Z

@golaz I am merging to master, and included a "tests3" directory with my test regime and instructions. Please let me know if your field testing uncovers anything unexpected.

forsyth2 · 2025-01-08T19:53:52Z

@TonyB9000 How did you merge this to main? Did you do "Squash and Merge" on the PR, or git merge from the command line?

The commit history now include multiple commits from this PR.

We expect more like this:

forsyth2 · 2025-01-08T20:06:48Z

@tomvothecoder (or maybe @xylar) Re: comment above (#355 (comment)), I clicked the "Revert" button, but that just opens a PR to merge a commit undoing all the changes -- i.e. it doesn't change the commit history.

How can we clean up the commit history? I'm thinking the following?

git fetch upstream
git checkout -b main upstream/main

# Rebase off the latest wanted commit ("Fix pre-commit error" on https://github.com/E3SM-Project/zstash/commits/main/)
git rebase -i c9ea93737b0019c1e679344199dd16d3b6f3adff
# In the rebase, do `d` to drop the merge commit.
# Will that also drop all the other commits it merged in????

# Force push changes
git push -f upstream main

# Re-merge the changes from this PR with "Squash and Merge"

forsyth2 · 2025-01-08T20:09:59Z

i.e. it doesn't change the commit history.

According to https://docs.github.com/en/desktop/managing-commits/reverting-a-commit-in-github-desktop, this is true: "When you revert to a previous commit, the revert is also a commit. The original commit also remains in the repository's history."

That's not what we want; we want a commit history without all the minor commits, only the commit from the merge of this PR.

forsyth2 · 2025-01-08T20:12:22Z

I'm not really sure how this happened, considering we have this set:

We also have "Squash and Merge" set as the only possible way to merge a PR:

TonyB9000 · 2025-01-08T20:13:13Z

@forsyth2 I applied the "merge" on the web-page, not from the command line.

First, from chrysalis:

get fetch origin
git rebase origin/non-block-testing

to incorporate your latest changes. Then

git push origin non-block-testing

to push my test-regime files (induces no conflicts). (My screens look different than yours. I on Firefox with dark-mode)

The instructions you provided did not work, BTW. I t did not like anything with "upstream" .

xylar · 2025-01-10T22:13:47Z

in that case I'd say the work should be multiple pull requests as well, though.

Here, I mostly disagree. Typically, a feature isn't fully testable until it is complete and a partially tested, incomplete feature should not be merged.

forsyth2 · 2025-01-10T22:34:00Z

bite-sized change that is understandable on its own

I see where you're coming from. I think my main issue here is that pull requests, at least for me, are almost never developed in a logical order. It's "got piece X working, got piece Y working, oh here's an improvement to X, here's piece Z, and another improvement for piece Y" -- it's not "Ok, X works. Ok now Y works. Ok now Z works." And that's why, for me, just making one commit that's "X,Y,Z all work now" often makes more sense.

For the zppy/zppy-interfaces refactor, for example, the commits on https://github.com/E3SM-Project/zppy/pull/642/commits and https://github.com/E3SM-Project/zppy-interfaces/pull/1/commits were squashed into single commits on their respective main branches, which I think makes sense because these commit histories show more the haphazard nature of development than they do discrete, logical units of work.

In any case, these sorts of design decisions seem like something that should be discussed more and settled; I can bring these points up with @tomvothecoder @chengzhuzhang and @golaz in our next meeting.

xylar · 2025-01-10T22:42:26Z

I think that argues for small commits with a distinct purpose that touch only one or a few files, together with frequent fix-ups that get squashed with the appropriate original comments using rebasing.

Squashing a bunch of commits into a single one because they are messy just creates a single messy commit in my view.

forsyth2 · 2025-01-10T22:45:29Z

That seems like it could be a good plan! I can try that going forward

TonyB9000 · 2025-01-11T00:27:34Z

@forsyth2 I just pushed a small update to the tests3 README file (fixed several typos). Should I ask for a review? Do a PR? (it is on the same branch that was recently merged.)

tomvothecoder · 2025-01-11T01:13:42Z

Both squash and commit and merge and commit has its own advantages and disadvantages. I think the most important thing is how it is done by the contributors and maintainers. Squash and commit for massive PRs might not be ideal, while merge and commit with a ton of messy commits is also not ideal (requires standardized practice).

I still like using squash and commit with xCDAT because we make smaller, well-defined features and bug fixes at this point in its lifecycle. One thing to note is that PRs will have all of the commits still even after the squash and commit (unless you delete the related branch). However, if you manually squash via git and force push before merging you will lose all of the history (I don't do this anymore)

In any case, here's ChatGPT's response. Squash and commit vs. merge and commit via ChatGPT:

The choice between squash and commit and merge and commit often depends on the development workflow and the goals of your version control strategy. Here's a breakdown of the two:
The choice between squash and commit and merge and commit often depends on the development workflow and the goals of your version control strategy. Here's a breakdown of the two:

Squash and Commit

What It Does: Combines all the commits from a feature branch into a single commit before adding it to the target branch (e.g., main or develop).
Use Case: Ideal for creating a clean, concise history in the target branch.
Advantages:
- Results in a simpler commit history.
- Makes it easier to review and understand changes as one cohesive unit.
- Useful for small, focused feature branches or bug fixes.
Disadvantages:
- Loses the granular history of the branch (e.g., step-by-step progress and debugging information).
- Harder to debug if a problem arises later, as you can’t easily track individual steps.

When to Use:

When the branch contains many small, incremental, or experimental commits that don’t add significant value to the main history.
For repositories where clean commit history is prioritized.

Merge and Commit

What It Does: Combines the feature branch into the target branch with all its commits preserved. Often creates a merge commit to record the merge itself.
Use Case: Suitable for retaining the detailed history of changes made in the feature branch.
Advantages:
- Preserves the complete commit history, including all intermediate commits.
- Useful for complex changes where each commit is meaningful (e.g., multiple logical steps in a large feature).
- Easier to debug because you can see the progression of changes over time.
Disadvantages:
- Can result in a cluttered or noisy commit history.
- Reviewers might need to sift through many small commits.

When to Use:

For long-running feature branches where each commit represents a meaningful step.
In repositories where preserving detailed commit history is important for auditing or debugging.

Choosing the Right Approach

Team's Workflow: Align with your team's Git workflow (e.g., GitHub Flow, GitLab Flow, or trunk-based development).
Project Size and Complexity: For large projects with multiple contributors, preserving detailed commit history might be crucial.
Repository Purpose:
- For libraries or open-source projects, clean history (via squash) may be preferred.
- For internal tools or experimentation, detailed history (via merge) might be more valuable.

Common Scenarios

Small feature or bug fix: Squash and commit to keep the main branch tidy.
Large feature or refactor: Merge and commit to preserve the full history.
Code Review Platforms: Some platforms allow you to choose during a pull/merge request, giving flexibility based on the context.

Would you like help with setting up a workflow or exploring more about either approach?

tomvothecoder · 2025-01-11T01:30:16Z

Also I used merge and commit for a large refactor effort before and we would manually go through the git history to squash things into logical groups before merging. I think if contributors have enough discipline on a small team, this can work nicely. However, it's probably a ton more work for larger open-source projects where people's commit habits are all different (e.g., Xarray uses squash and commit).

xylar · 2025-01-11T07:23:34Z

@tomvothecoder, thanks for sharing. Interesting thoughts.

However, it's probably a ton more work for larger open-source projects where people's commit habits are all different (e.g., Xarray uses squash and commit).

I feel like exactly such an environment is potentially where squashing commits causes a debugging nightmare if a subtle bug is introduced. E3SM itself is a place where squash commits would be a very bad idea, and that is in part because of the diverse habits of the developers.

forsyth2 · 2025-01-13T18:53:49Z

@tomvothecoder Thanks, this is really helpful.

I'm also copying the notes I put for today's team meeting:

Up until now, my general workflow has been

Push commits to the feature branch as I get little pieces working.
Merge using the Squash & Merge option on GitHub. (I had this set as the default, at least for me, but Tony’s default was the merge commit option. I then changed the repo settings to require Squash & Merge)
The end result is that main had very few commits – each commit having a 1:1 correspondence with a PR. Meanwhile, the commit history on the PR was more thorough. That said, I often squashed commits even on the feature branch, for conciseness.

On this thread, came up with this plan:

On a feature branch, make lots of tiny, incremental commits.
Before merging the PR, use git rebase -i to squash/fixup commits so that they form logical units of work (e.g., a single commit for “add plots” even if that work was done in several non-contiguous commits)
- Concern 1: If we have commits A, B, C, and we’re trying to squash C into A, I could then see conflicts occurring when B is applied.
Merge the PR with a merge commit (i.e., not squash & merge). This will keep the commits representing logical units of work in the commit history.
- Concern 2: That’s reasonable if there’s not concurrent work; but if we merge two PRs with lots of commits at around the same time, the commit history is going to be very intertwined. That is, the commit history is no longer linear.

Other notes on commit histories:

Does main’s commit history represent a concise story of its development or an engineering log of everything that ever happened?
Most importantly, should we force push to zstash’s main to keep a linear commit history or leave the merge commit and keep using merge commits going forward?
Benefit of a linear history: easier to see the big picture of what’s been changed (and by extension, easier to write release notes)
Benefit of merge commits: more fine-grained changes, easier to understand a smaller chunk of code at a time

I just pushed a small update to the tests3 README file (fixed several typos). Should I ask for a review? Do a PR? (it is on the same branch that was recently merged.)

@TonyB9000 Definitely do a PR. Wait to merge until we figure out how to proceed with the commit history. I'm planning to discuss with @chengzhuzhang and @golaz, but I'm still inclined to squash the commits on main, especially to be consistent with previous commits for this release.

xylar · 2025-01-13T19:27:18Z

Concern 1: If we have commits A, B, C, and we’re trying to squash C into A, I could then see conflicts occurring when B is applied.

This is why it is better to do git rebase -i often and not to try to intertwine unrelated development on the same file if you can avoid it. If things have got tangled up, it is time to allow yourself to have 2 commits with similar purpose, e.g. add plots and update plots.

Concern 2: That’s reasonable if there’s not concurrent work; but if we merge two PRs with lots of commits at around the same time, the commit history is going to be very intertwined. That is, the commit history is no longer linear.

This is not really a bad thing. It is one of the huge improvements in git over its predecessors like svn. If you have trouble looking at commit history because it's not linear, I recommend adding this alias to your .gitconfig:

[alias]
        logg = log --graph --oneline --decorate

With git logg, you will see in some pretty clear ascii art how the commits work together.

The goal should be a clean development workflow, not a clean git history. I still think the squash merge as a policy is too blunt an instrument and sacrifices a clean workflow.

xylar · 2025-01-13T19:29:57Z

Here's an example of what that can look like for overlapping feature development:

You can follow the color-coded lines to see where a feature branch started from, where its commits were added, and where it was merged back.

forsyth2 · 2025-01-13T19:55:25Z

If things have got tangled up, it is time to allow yourself to have 2 commits with similar purpose, e.g. add plots and update plots.

That makes sense to me if we're doing finer-grained commits.

Ok, git logg does look nicer than how https://github.com/E3SM-Project/zstash/commits/main shows it. It would be nice if the GitHub web interface did something similar.

*   10a5b8f (HEAD -> main, upstream/main, upstream/HEAD, main_20250110) Merge pull request #355 from E3SM-Project/non-block-testing
|\  
| * 4a2a8df added tests3 scripts and README for blocking tests
| * c9ea937 (upstream/non-block-testing, non-block-testing) Fix pre-commit error
| * 7ed1f92 Unit tests passing
| *   5aef52d recommit after pull/merge of Ryan's changes
| |\  
| | * 633a543 (non-block-testing-before-20250106) Fix pre-commit checks and run tests
| * | 5738ff2 minor cleanup
| |/  
| * d7bae50 datetime and timezone unneeded here (ts_utc in util), and corrected comment in hpss
| * 0540e93 Made globus_block_wait a function, clarified log messages, revert parameters
| * 35b98d4 Passed non_blocking variable down call-chain, implemented loop for task_wait on blocking, force globus_transfer to return a sting status, added more logging.
| * f48a372 Passed non_blocking variable down call-chain, added more logging.
| * 969a7c1 Passed non_blocking variable down call-chain, implemented cache-delete-file when blocking success, added more logging.
| * 768e3a9 Added more log messaging
| * a434a36 added ts_utc() function for convenient UTC timestamps
|/  
* 082afc7 Update pre-commit dependencies (#353)
* ef8372a Tweak behaviors of build and release CI workflows (#352)
* 18c84f1 Update supported python versions (#351)
* 74fcc84 Add symlink tests and refactor for new tests (#348)
* 98e6ae1 Add PR template (#347)
* 6894798 (test-post-343) Fix symlink behavior (#343)
* ef87eaf Add DOI badge (#342)

@xylar Re: plan going forward -- so your strong preference would be 1) leave the merge commit from this PR as-is, 2) use merge commits exclusively going forward. Is that correct?

forsyth2 · 2025-01-13T19:57:44Z

Oh it occurs to me my other big problem with having lots of commits is if main changes significantly. It's way easier to squash the commits on a feature branch into one, and then address merge conflicts all at once, rather than iteratively through n commits.

xylar · 2025-01-13T20:17:07Z

@xylar Re: plan going forward -- so your strong preference would be 1) leave the merge commit from this PR as-is, 2) use merge commits exclusively going forward. Is that correct?

My preference would just be not to have a policy of using exclusively squash merge. I wanted to argue for why it should be an option to use merge commits. You can still use squash-merge when not too much info is lost in the process.

Oh it occurs to me my other big problem with having lots of commits is if main changes significantly. It's way easier to squash the commits on a feature branch into one, and then address merge conflicts all at once, rather than iteratively through n commits.

Ah, you've got me there. I do try to rebase first to squash as many commits as I reasonably can but then the process of rebasing onto main can be painful. Sometimes that's when I throw up my hands and start a new branch where I copy over my changes manually.

xylar · 2025-01-13T20:27:03Z

It would be nice if the GitHub web interface did something similar.

Apparently it can be done:
https://chromewebstore.google.com/detail/le-git-graph-commits-grap/joggkdfebigddmaagckekihhfncdobff

forsyth2 · 2025-01-13T20:27:22Z

not to have a policy of using exclusively squash merge

Ok so it seems like the general plan should be squash & merge for small changes, merge commits for large changes, but avoid strict policies. That seems reasonable to me. I can update the git branch protections accordingly.

I do try to rebase first to squash as many commits as I reasonably can but then the process of rebasing onto main can be painful.

Yeah I guess in this case it might make sense to use git rebase -i to at least shrink the number of commits to units of work rather than minimal changes like "add import here" or "fix spacing", so there are fewer, if not one, commits to address conflicts on.

Sometimes that's when I throw up my hands and start a new branch where I copy over my changes manually.

Oh wow! That sounds pretty tedious though.

xylar · 2025-01-13T20:42:49Z

The chrome extension works but I still like the command-line version better:

rljacob · 2025-01-13T20:44:11Z

The main model uses merge commits. "git log --first-parent" gives you a linear commit history by showing you only the merge commits.

There are also standalone applications like GitUp that let you browse the branches graphically.

xylar · 2025-01-13T21:01:20Z

like Github

I guess that's Gitup, not Github?

xylar · 2025-01-13T21:15:08Z

One more situation where squash-merge is undesirable is when multiple people are involved in development. Some (or maybe even all!) lose credit in that process.

forsyth2 · 2025-01-13T21:22:32Z

One more situation where squash-merge is undesirable is when multiple people are involved in development. Some (or maybe even all!) lose credit in that process.

I've noticed on Squash & Merge, it will list everyone who authored any of the commits as an author, but you're right that which person did which work gets lost

xylar · 2025-01-13T21:30:51Z

Ah, still, better than losing authors entirely.

forsyth2 · 2025-01-13T22:46:28Z

@xylar @tomvothecoder Conclusions from meeting with @TonyB9000, @chengzhuzhang, @golaz:

Conclusion for this PR is to simply leave Tony’s merge commit as-is.
Going forward, commit type should be case-by-case, not standard policy. Individual developers can use their best judgment.
- So I'll re-enable the other merge types on the repos then.
General guideline: PR is for adding a single feature, but a PR might be multiple commits if the feature has a number of distinct parts
- Squashing commits on a feature branch is fine to consolidate coherent chunks of work.

Thanks for this discussion; it was very informative.

TonyB9000 · 2025-01-13T22:55:18Z

@forsyth2 I pushed my update to the tests3/README documentation last week - but there is no "Do a PR" option where I usually find one (here). Perhaps this is because I pushed to a branch already merged? How to proceed?

forsyth2 · 2025-01-13T22:57:21Z

Perhaps this is because I pushed to a branch already merged?

That would be my guess

How to proceed?

From the command line:

# Make sure you're on the branch
git checkout non-block-testing

# Change the branch's name
git branch -m non-block-testing-fix

# Push the new branch name
git push origin non-block-testing-fix

forsyth2 · 2025-01-13T23:07:48Z

So I'll re-enable the other merge types on the repos then.

Actually I still think it might be good to disallow rebase & merge -- the other two at least have a very clear "this PR introduced this change" connection. The rebase & merge option adds commits individually which I think could get very messy.

xylar · 2025-01-14T06:28:33Z

That's right, I don't think the rebase merge is a good idea.

Anthony Bartoletti added 5 commits December 18, 2024 14:52

added ts_utc() function for convenient UTC timestamps

a434a36

Added more log messaging

768e3a9

Passed non_blocking variable down call-chain, implemented cache-delet…

969a7c1

…e-file when blocking success, added more logging.

Passed non_blocking variable down call-chain, added more logging.

f48a372

Passed non_blocking variable down call-chain, implemented loop for ta…

35b98d4

…sk_wait on blocking, force globus_transfer to return a sting status, added more logging.

TonyB9000 requested review from golaz and forsyth2 December 19, 2024 19:02

forsyth2 reviewed Dec 20, 2024

View reviewed changes

Made globus_block_wait a function, clarified log messages, revert par…

0540e93

…ameters

Anthony Bartoletti and others added 2 commits December 20, 2024 13:15

datetime and timezone unneeded here (ts_utc in util), and corrected c…

d7bae50

…omment in hpss

Fix pre-commit checks and run tests

633a543

forsyth2 reviewed Dec 27, 2024

View reviewed changes

Anthony Bartoletti and others added 3 commits January 2, 2025 13:28

minor cleanup

5738ff2

recommit after pull/merge of Ryan's changes

5aef52d

Unit tests passing

7ed1f92

Fix pre-commit error

c9ea937

golaz approved these changes Jan 7, 2025

View reviewed changes

added tests3 scripts and README for blocking tests

4a2a8df

TonyB9000 marked this pull request as ready for review January 8, 2025 19:47

TonyB9000 merged commit 10a5b8f into main Jan 8, 2025
2 of 3 checks passed

forsyth2 mentioned this pull request Jan 8, 2025

Revert "Non block testing" #356

Closed

forsyth2 mentioned this pull request Jan 13, 2025

fixed typos in tests3 README file. #357

Merged

4 tasks

This was referenced Jan 14, 2025

Add Viewer support E3SM-Project/zppy#648

Merged

Add Viewer support E3SM-Project/zppy-interfaces#9

Merged

		# except GlobusHTTPError as e:
		# logger.error(f"Exception: {e}")

Non block testing #355

Non block testing #355

Conversation

TonyB9000 commented Dec 19, 2024 • edited Loading

Issue resolution

2. Are the implementation details accurate & efficient?

3. Is this well documented?

4. Is this code clean?

TonyB9000 commented Dec 19, 2024

forsyth2 left a comment

Choose a reason for hiding this comment

My analysis of the call hierarchy

Example code walk-throughs

Testing

Other notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TonyB9000 commented Dec 20, 2024 • edited Loading

forsyth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TonyB9000 Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forsyth2 commented Jan 7, 2025

golaz left a comment

Choose a reason for hiding this comment

TonyB9000 commented Jan 8, 2025

forsyth2 commented Jan 8, 2025

forsyth2 commented Jan 8, 2025

forsyth2 commented Jan 8, 2025

forsyth2 commented Jan 8, 2025

TonyB9000 commented Jan 8, 2025

xylar commented Jan 10, 2025

forsyth2 commented Jan 10, 2025

xylar commented Jan 10, 2025

forsyth2 commented Jan 10, 2025

TonyB9000 commented Jan 11, 2025

tomvothecoder commented Jan 11, 2025 • edited Loading

Squash and Commit

Merge and Commit

Choosing the Right Approach

Common Scenarios

tomvothecoder commented Jan 11, 2025 • edited Loading

xylar commented Jan 11, 2025

forsyth2 commented Jan 13, 2025 • edited Loading

xylar commented Jan 13, 2025

xylar commented Jan 13, 2025

forsyth2 commented Jan 13, 2025

forsyth2 commented Jan 13, 2025

xylar commented Jan 13, 2025

xylar commented Jan 13, 2025

forsyth2 commented Jan 13, 2025

xylar commented Jan 13, 2025

rljacob commented Jan 13, 2025 • edited Loading

xylar commented Jan 13, 2025

xylar commented Jan 13, 2025

forsyth2 commented Jan 13, 2025

xylar commented Jan 13, 2025

forsyth2 commented Jan 13, 2025 • edited Loading

TonyB9000 commented Jan 13, 2025

forsyth2 commented Jan 13, 2025

forsyth2 commented Jan 13, 2025

xylar commented Jan 14, 2025

TonyB9000 commented Dec 19, 2024 •

edited

Loading

TonyB9000 commented Dec 20, 2024 •

edited

Loading

TonyB9000 Jan 2, 2025 •

edited

Loading

tomvothecoder commented Jan 11, 2025 •

edited

Loading

tomvothecoder commented Jan 11, 2025 •

edited

Loading

forsyth2 commented Jan 13, 2025 •

edited

Loading

rljacob commented Jan 13, 2025 •

edited

Loading

forsyth2 commented Jan 13, 2025 •

edited

Loading