Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non block testing #355

Merged
merged 13 commits into from
Jan 8, 2025
Merged

Non block testing #355

merged 13 commits into from
Jan 8, 2025

Conversation

TonyB9000
Copy link
Collaborator

@TonyB9000 TonyB9000 commented Dec 19, 2024

Issue resolution

Properly implements Blocking and Non-Blocking behavior in archive creation/transfer.

task_wait timeout and polling_interval should be increased to 7200 and 10, respectively)

Still leaves open-ended issue when return from "waiting too long" (after max_retries, abort Y/N?)

Could be stress-tested with MUCH larger files, where weeks are required for transfers to complete.

Many added "logger.info" messages could be eliminated, or made "logger.debug" messages. All added messages use the format-string syntax

message = f"var = {var}"
rather than
message = "var = {}".format(var)

syntax.

Required:

  • [maybe] Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • [no] Testing: I have added at least one automated test. Every objective above is represented in at least one test.
  • [no] Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

  • [no] Testing: this pull request adds at least one new possible command line option. I have tested using this option with and without any other option that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

  • [yes] Logic: I have visually inspected the entire pull request myself.
  • [in_progress] Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

  • Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zstash/conda, not just an import statement.

3. Is this well documented?

Required:

  • [maybe] Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

  • [maybe] Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • [no] Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

@TonyB9000 TonyB9000 requested review from golaz and forsyth2 December 19, 2024 19:02
@TonyB9000
Copy link
Collaborator Author

(Wot a hassle!)

Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TonyB9000! This was quite a logically complex piece of zstash to work on.

My analysis of the call hierarchy

There's a lot of code to follow here, but I think I've finally managed to understand all the major pieces. I've included thorough notes on that below. (Notes below are primarily for myself, but they might be helpful to others).

Verbose call hierarchy
create()
    logger.debug(f"{ts_utc()}: Running zstash create")
    logger.debug(f"{ts_utc()}:Calling globus_activate(hpss)")
    globus_activate(hpss)
        # No changes in this PR
    logger.debug(f"{ts_utc()}: Creating local cache directory")
    logger.debug(f"{ts_utc()}: Calling create_database()")
    failures: List[str] = create_database(cache, args)
        logger.debug(f"{ts_utc()}:Creating index database")
        failures = add_files(..., non_blocking=args.non_blocking)
            def add_files(..., non_blocking: bool = False):
            logger.info(f"{ts_utc()}: Creating new tar archive {tfname}")
            if i == nfiles - 1 or tarsize + next_file_size > maxsize:
                logger.debug(f"{ts_utc()}: Closing tar archive {tfname}")
                logger.info(f"{ts_utc()}: (add_files): Completed archive file {tfname}")
                logger.info(f"{ts_utc()}: DIVING: (add_files): Calling hpss_put to dispatch archive file {tfname}")
                hpss_put(hpss, os.path.join(cache, tfname), cache, keep, non_blocking)
                    # See hpss_put a few lines down
                logger.info(f"{ts_utc()}: SURFACE (add_files): Called hpss_put to dispatch archive file {tfname}")
    logger.debug(f"{ts_utc()}: calling hpss_put() for {get_db_filename(cache)}")
    hpss_put(hpss, get_db_filename(cache), cache, keep=True)
        def hpss_put(..., non_blocking: bool = False):
        hpss_transfer(hpss, file_path, "put", cache, keep, non_blocking)
            def hpss_transfer(..., non_blocking: bool = False):
            globus_status = "UNKNOWN"
            logger.info(f"{ts_utc()}: DIVING: hpss calls globus_transfer(name={name})")
            globus_status = globus_transfer(endpoint, url_path, name, transfer_type, non_blocking)
                # Changed in this PR!!!
                def globus_transfer(..., non_blocking: bool):
                # Bulk of the changes in the PR are in this logic
            if not keep and scheme == "globus" and globus_status == "SUCCEEDED" and not non_blocking:
                # We should not keep the local file, so delete it now that it is on HPSS
                os.remove(file_path)
            logger.info(f"{ts_utc()}: SURFACE hpss globus_transfer(name={name}) returns")
    logger.debug(f"{ts_utc()}: calling globus_finalize()")
    globus_finalize(non_blocking=args.non_blocking)
        # No changes in this PR
Simplified call hierarchy
create()
    globus_activate(hpss)
        # No changes in this PR
    failures: List[str] = create_database(cache, args)
        failures = add_files(..., non_blocking=args.non_blocking)
            if i == nfiles - 1 or tarsize + next_file_size > maxsize:
                hpss_put(hpss, os.path.join(cache, tfname), cache, keep, non_blocking)
                    hpss_transfer(hpss, file_path, "put", cache, keep, non_blocking)
                        globus_status = "UNKNOWN"
                        globus_status = globus_transfer(endpoint, url_path, name, transfer_type, non_blocking)
                            # Changed in this PR!!!
                            # Bulk of the changes in the PR are in this logic
                        if not keep and scheme == "globus" and globus_status == "SUCCEEDED" and not non_blocking:
                            # We should not keep the local file, so delete it now that it is on HPSS
                            os.remove(file_path)
    hpss_put(hpss, get_db_filename(cache), cache, keep=True)
        # See hpss_put line above
        # This block adds the db file
    globus_finalize(non_blocking=args.non_blocking)
        # No changes in this PR
Tony's pseudo-code
1   While Files_to_archive
2       Make_a_tar_file (stuff while tar size < MAXSIZE)
3       Transfer_a_tar_file
4           Add tar_file to TransferData items (create if needed)
5           if a Transfer is ACTIVE
6               return to Make another tar file
7           else
8               Start a transfer (dir of tar-files)
9           If BLOCKING
10              Try to wait for transfer completion
11          return Task_Status
12      Check Task_Status
Matching the pseudo-code to the real code
hpss_utils.py add_files
    # 1. While Files_to_archive
    for i in range(nfiles):
        if create_new_tar:
            create_new_tar = False
            # 2. Make_a_tar_file (stuff while tar size < MAXSIZE)
            tar = tarfile.open(mode="w", fileobj=tarFileObject, dereference=follow_symlinks) 
        hpss_utils.py add_file(tar, current_file, follow_symlinks)
        tarsize = tarFileObject.tell()
        if i == nfiles - 1 or tarsize + next_file_size > maxsize:
            hpss.py hpss_put(hpss, os.path.join(cache, tfname), cache, keep, non_blocking)
                hpss.py hpss_transfer(hpss, file_path, "put", cache, keep, non_blocking)
                    # 3. Transfer_a_tar_file
                    globus.py globus_transfer(endpoint, url_path, name, transfer_type, non_blocking)
                        # 4. Add tar_file to TransferData items (create if needed)
                        # (the tar file name is contained in src_path and dst_path)
                        transfer_data.add_item(src_path, dst_path)
                        if task_id:
                            task = transfer_client.get_task(task_id)
                            # 5. if a Transfer is ACTIVE
                            if task["status"] == "ACTIVE":
                                # 6. return to Make another tar file
                                # (moving on to file i+1)
                                return "ACTIVE"
                        # 7. else
                        # 8. Start a transfer (dir of tar-files)
                        task = submit_transfer_with_checks(transfer_data)
                        task_id = task.get("task_id")
                        transfer_data = None
                        # 9. If BLOCKING
                        if not non_blocking:
                            # 10. Try to wait for transfer completion
                            while retry_count < max_retries:
                                transfer_client.task_wait(task_id, timeout=wait_timeout, polling_interval=1)
                                if task_status == "SUCCEEDED":
                                    break
                        # 11. return Task_Status
                        return task_status
                    # 12. Check Task_Status
                    if not keep and scheme == "globus" and globus_status == "SUCCEEDED" and not non_blocking:
                        os.remove(file_path)
            create_new_tar = True

Example code walk-throughs

Say we have 8 equally sized files. 3 files fit into a tar of maxsize.

Non-blocking
Process file1
    Create tar1
    Add file1 to tar1
Process file2
    Add file2 to tar1
Process file3
    Add file3 to tar1
    # We've reached the threshold to submit a tar
    Add tar1 to list of items to transfer
    # task_id does not yet exist
    Submit transfer of list of items to transfer # [tar1 (containing files 1-3)]
Process file4
    Create tar2
    Add file4 to tar2
Process file5
    Add file5 to tar2
Process file6
    Add file6 to tar2
    # We've reached the threshold to submit a tar
    Add tar2 to list of items to transfer
    # task_id exists because tar1 was submitted
    # task_id is ACTIVE, because tar1 is still transferring
    So, we percolate back up the call stack, all the way to the next file in the add_files loop
Process file7
    Create tar3
    Add file7 to tar3
    # We haven't reached the threshold to submit a tar
    # (We have with tar2, but we only care about tar3 now!)
Process file8
    Add file8 to tar3
    # We haven't reached the threshold to submit tar3, 
    # BUT we have hit the last file
    # task_id exists because tar1 was submitted
    Case 1:
        # task_id is ACTIVE, because tar1 is still transferring
        So, we percolate back up the call stack, all the way to the next file in the add_files loop
        # Of course, this was the last file, so we get back to work in create.py
        # (But the global variable transfer_data reminds us we still have data to transfer)
        # We `hpss_put` the index.db
        # And then `globus_finalize` does the final submission,
        # regardless of if the previous transfer is still going!!
        # In that case, there would be two simultaneous Globus transfers.
        Submit transfer of list of items to transfer # [tar2 (containing files 4-6), tar3 (containing files 7-8)]
    Case 2:
        # task_id is SUCCEEDED, because tar1 has finished transferring
        Submit transfer of list of items to transfer # [tar2 (containing files 4-6), tar3 (containing files 7-8)]
        # From there, we can get back to work in create.py
        # We `hpss_put` the index.db
        # And then `globus_finalize` does nothing, because we already did the final transfer
Blocking

For reference, the relevant code block from above is:

                        # 9. If BLOCKING
                        if not non_blocking:
                            # 10. Try to wait for transfer completion
                            while retry_count < max_retries:
                                transfer_client.task_wait(task_id, timeout=wait_timeout, polling_interval=1)
                                if task_status == "SUCCEEDED":
                                    break
Process file1
    Create tar1
    Add file1 to tar1
Process file2
    Add file2 to tar1
Process file3
    Add file3 to tar1
    # We've reached the threshold to submit a tar
    Add tar1 to list of items to transfer
    # task_id does not yet exist
    Submit transfer of list of items to transfer # [tar1 (containing files 1-3)]
    # We are in the blocking case!
    # JUST KEEP WAITING, until we hit max_retries
    Case 1:
        # Transfer did complete
        # We move on with task["status"] == "SUCCEEDED"
    Case 2:
        # Transfer did not complete
        # We move on with task["status"] == "ACTIVE"
        # At this point, we have artificially entered non-blocking behavior
        # I.e., just go up to the earlier code block I have for non-blocking
Process file4
    Create tar2
    Add file4 to tar2
Process file5
    Add file5 to tar2
Process file6
    Add file6 to tar2
    # We've reached the threshold to submit a tar
    Add tar2 to list of items to transfer
    # task_id exists because tar1 was submitted
    # task_id is SUCCEEDED, because we blocked until tar1 finished transferring
    Submit transfer of list of items to transfer # [tar2 (containing files 4-6)]
    # We are in the blocking case!
    # Same thing as above for tar1
Process file7
    Create tar3
    Add file7 to tar3
Process file8
    Add file8 to tar3
    # We haven't reached the threshold to submit tar3, 
    # BUT we have hit the last file
    # task_id exists because tar2 was submitted
    # task_id is SUCCEEDED, because we blocked until tar2 finished transferring
    Submit transfer of list of items to transfer # [tar3 (containing files 7-8)]
    # We are in the blocking case!
    # Same thing as above for tar1
    # From there, We can get back to work in create.py
    # We `hpss_put` the index.db
    # And then `globus_finalize` does nothing, because we already did the final transfer

Tony, do I have the logic of the two example walk-throughs correct?

Testing

From the above, it appears the 3 major test cases would be:

  • Non-blocking: transfer one tar, build up a list of multiple tars while that tar is transferring
    • that first transfer is finished before we do the final transfer => immediately do last transfer (of multiple tars) and globus_finalize doesn't have to do anything
    • that first transfer is still going once we've added all the files to tars => use globus_finalize to do the last transfer (meaning there would be a second Globus transfer in parallel)
  • Blocking: transfer one tar, transfer another tar, transfer another tar

We should add an integration test for this too, but that will require some thought in order to make sure we actually hit the conditions. I'm hoping we could do that without some exorbitant amount of testing data. The trickier thing however is that alot depends on timing, which is very challenging to test. I suppose proper use of sleep commands could work..., but even then I don't think we could introduce that to the Globus transfer times....

Other notes

I think it's good to include the extra logging. We can just set set in-depth logging lines to be debug-level.

I've also included a few in-line comments as part of this review.

zstash/create.py Outdated
globus_activate(hpss)
else:
# config.hpss is not "none", so we need to
# create target HPSS directory
logger.debug("Creating target HPSS directory")
logger.debug(f"{ts_utc()}:Creating target HPSS directory {hpss}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space after : for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 I expect to make many edits such as this before full acceptance. My method is to make the changes in my local (chrysalis) repo and commit/push them, rather than make the changes in the web/ui as that confuses me. For instance, I added "DIVING" and "SURFACE" as log-terms just before and after a major function call, so I could be sure when and where processing was occurring, and I expect to remove those terms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My method is to make the changes in my local (chrysalis) repo and commit/push them, rather than make the changes in the web/ui

Agreed, I like that method better too.

zstash/globus.py Outdated
)
)
else:
logger.error("Transfer FAILED")
logger.error(f"{ts_utc()}: Transfer FAILED (task_id = {task_id})")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider updating this error logging to include what the actual task["status"] was, because it's not necessarily "FAILED"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Will modify to issue status.
NOTE: An issue with this entire code-block is that we may or may-not return, may or may-not retain the previous task_id, etc. Logging needs to clarify which is which.

zstash/globus.py Outdated
wait_timeout = 5
max_retries = 3
retry_count = 0
while retry_count < max_retries:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this part. So, we just give up after max_retries * wait_timeout = 3*5 = 15 seconds? And pretend like we wanted to be non-blocking? Don't Globus transfers typically take hours?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values were made artificially small for test purposes (rather than incur massive test delays.) You don's see it here, but in create.py I had changed MAXSIZE from nGB to n100,000 bytes, to force multiple tarfile generations quickly.

If you don't use a "while ... retry < max_retries" loop, and the transfer hangs, then zstash hangs. You want to avoid that.

I'll provide more reasonable values here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, that makes sense, thanks.

@TonyB9000
Copy link
Collaborator Author

TonyB9000 commented Dec 20, 2024

@forsyth2 @golaz I addressed Ryan's comments (I think), but also decided to make a function for the new "globus_block_wait()", to help clarify the code. Retested successfully in both blocking and non-blocking modes. Review should go quickly.

Feel free to suggest additional edits, etc.

Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TonyB9000 I added a new commit (633a543) that addresses the pre-commit failures. I also ran the unit tests but am still getting some failures that I need to debug.

To try these things yourself, run:

# Make sure everything for Globus is working
cd <some directory>
mkdir zstash_demo; echo 'file0 stuff' > zstash_demo/file0.txt
rm ~/.globus-native-apps.cfg
# Load Unified, so we can use a known-working version of zstash
source /global/common/software/e3sm/anaconda_envs/load_latest_e3sm_unified_pm-cpu.sh
zstash create --hpss=globus://6c54cade-bde5-45c1-bdea-f4bd71dba2cc/~/manual_run zstash_demo
# Paste the two auth codes

# Set up conda dev environment
cd <zstash-clone>
conda clean --all --y
conda env create -f conda/dev.yml -n zstash_pr355
conda activate zstash_pr355

# Run pre-commit checks
pre-commit run --all-files
# Keep running this until it auto-fixes everything or gives you actionable requests.

# Run the unit tests
pip install .
python -u -m unittest tests/test_*.py 
# I'm currently getting `Ran 69 tests`, `failures=34`

We should include the test cases in the PR. If it's too difficult to convert into the unit test framework, at least adding a script to run manually would be useful.

zstash/globus.py Outdated
Comment on lines 316 to 317
# except GlobusHTTPError as e:
# logger.error(f"Exception: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error doesn't seem to exist in any package I can find. Is it a typo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was likely borrowed from the interwebs as I crafted that whole try-block. It was the only "exception" listed, which I thought was weird (what about other exceptions?) I would be happy to eliminate it, and go with the generic "any" exception I added from seeing other examples.

zstash/hpss.py Outdated
Comment on lines 114 to 124
if (
not keep
and scheme == "globus"
and globus_status == "SUCCEEDED"
and not non_blocking
):
# We should not keep the local file, so delete it now that it is on HPSS
os.remove(file_path)
if not keep and scheme != "globus":
# We should not keep the local file, so delete it now that it is on HPSS
os.remove(file_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing the logic to the following to make it somewhat easier to follow:

if not keep:
  if scheme != "globus":
    os.remove(file_path)
  elif globus_status == "SUCCEEDED" and not non_blocking:
    os.remove(file_path)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or alternatively make use of short-circuit evaluation:

if not keep:
  if (scheme != "globus") or (globus_status == "SUCCEEDED" and not non_blocking):
    os.remove(file_path)

Copy link
Collaborator Author

@TonyB9000 TonyB9000 Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only reason I kept it as I did was that the second "if" statement was there originally, and I was not sure if it was a good idea to reduce the logic (in case my preceding clause needed to be backed out.) I will consolidate.

(ACK! chrysalis tells me "disk quota exceeded" - I cannot save anything. ?!?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK (have space, will travel). I hope I got this correct. My first try to push suggested changes demanded I first pull down your latest commits. I did so - but had to negotiate a merge/conflict (never sure when to use rebase/fast-forward/etc).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TonyB9000

demanded I first pull down your latest commits. I did so - but had to negotiate a merge/conflict (never sure when to use rebase/fast-forward/etc).

I generally use git rebase rather than git pull

git fetch upstream branch-name
git rebase upstream/branch-name
git status
# Merge conflicts will be listed; fix those
git add -A # or just add the changed files individually
git rebase --continue

If you have a lot of commits, this may need to be repeated several times. If the commit history is not very important (it's usually not for commits on a single PR), you can change the history to be just a single commit:

git log
git rebase -i <hash of last commit you don't want to roll into one>
# Keep the `p` (for pick) for the first commit listed
# Replace remaining `p` with `f` (for fixup -- this will roll the commit into the one above it)
# Now, run the code block above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 Thanks Ryan. I'll add the rebase sequence to my personal "git how-to" page.

@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 7, 2025

@TonyB9000 The unit test failures I was running into turned out to have a simple fix. I just pushed the commit; the expected logging statements were just different because of the added timestamps.

The unit tests now pass, but I still think we should have some sort of integration test or at the very least, a README or shell script to test manually. I.e., would it be possible to add a script/directions of how you've been testing this issue? Thanks!

Other than that, this is probably good to merge.

Copy link
Collaborator

@golaz golaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on discussion during meetings and cursory review of the code changes. I will test under real world conditions as part of the next RC of e3sm unified.

@TonyB9000 TonyB9000 marked this pull request as ready for review January 8, 2025 19:47
@TonyB9000
Copy link
Collaborator Author

@golaz I am merging to master, and included a "tests3" directory with my test regime and instructions. Please let me know if your field testing uncovers anything unexpected.

@TonyB9000 TonyB9000 merged commit 10a5b8f into main Jan 8, 2025
2 of 3 checks passed
@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 8, 2025

@TonyB9000 How did you merge this to main? Did you do "Squash and Merge" on the PR, or git merge from the command line?

Screenshot 2025-01-08 at 11 53 10 AM
The commit history now include multiple commits from this PR.

We expect more like this:
Screenshot 2025-01-08 at 11 53 40 AM

@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 8, 2025

@tomvothecoder (or maybe @xylar) Re: comment above (#355 (comment)), I clicked the "Revert" button, but that just opens a PR to merge a commit undoing all the changes -- i.e. it doesn't change the commit history.

How can we clean up the commit history? I'm thinking the following?

git fetch upstream
git checkout -b main upstream/main

# Rebase off the latest wanted commit ("Fix pre-commit error" on https://github.com/E3SM-Project/zstash/commits/main/)
git rebase -i c9ea93737b0019c1e679344199dd16d3b6f3adff
# In the rebase, do `d` to drop the merge commit.
# Will that also drop all the other commits it merged in????

# Force push changes
git push -f upstream main

# Re-merge the changes from this PR with "Squash and Merge"

@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 8, 2025

i.e. it doesn't change the commit history.

According to https://docs.github.com/en/desktop/managing-commits/reverting-a-commit-in-github-desktop, this is true: "When you revert to a previous commit, the revert is also a commit. The original commit also remains in the repository's history."

That's not what we want; we want a commit history without all the minor commits, only the commit from the merge of this PR.

@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 8, 2025

I'm not really sure how this happened, considering we have this set:
Screenshot 2025-01-08 at 12 11 27 PM

We also have "Squash and Merge" set as the only possible way to merge a PR:
Screenshot 2025-01-08 at 12 12 00 PM

@TonyB9000
Copy link
Collaborator Author

@forsyth2 I applied the "merge" on the web-page, not from the command line.

First, from chrysalis:

get fetch origin
git rebase origin/non-block-testing

to incorporate your latest changes. Then

git push origin non-block-testing

to push my test-regime files (induces no conflicts). (My screens look different than yours. I on Firefox with dark-mode)

The instructions you provided did not work, BTW. I t did not like anything with "upstream" .

@xylar
Copy link
Contributor

xylar commented Jan 10, 2025

in that case I'd say the work should be multiple pull requests as well, though.

Here, I mostly disagree. Typically, a feature isn't fully testable until it is complete and a partially tested, incomplete feature should not be merged.

@forsyth2
Copy link
Collaborator

bite-sized change that is understandable on its own

I see where you're coming from. I think my main issue here is that pull requests, at least for me, are almost never developed in a logical order. It's "got piece X working, got piece Y working, oh here's an improvement to X, here's piece Z, and another improvement for piece Y" -- it's not "Ok, X works. Ok now Y works. Ok now Z works." And that's why, for me, just making one commit that's "X,Y,Z all work now" often makes more sense.

For the zppy/zppy-interfaces refactor, for example, the commits on https://github.com/E3SM-Project/zppy/pull/642/commits and https://github.com/E3SM-Project/zppy-interfaces/pull/1/commits were squashed into single commits on their respective main branches, which I think makes sense because these commit histories show more the haphazard nature of development than they do discrete, logical units of work.

In any case, these sorts of design decisions seem like something that should be discussed more and settled; I can bring these points up with @tomvothecoder @chengzhuzhang and @golaz in our next meeting.

@xylar
Copy link
Contributor

xylar commented Jan 10, 2025

I think that argues for small commits with a distinct purpose that touch only one or a few files, together with frequent fix-ups that get squashed with the appropriate original comments using rebasing.

Squashing a bunch of commits into a single one because they are messy just creates a single messy commit in my view.

@forsyth2
Copy link
Collaborator

That seems like it could be a good plan! I can try that going forward

@TonyB9000
Copy link
Collaborator Author

@forsyth2 I just pushed a small update to the tests3 README file (fixed several typos). Should I ask for a review? Do a PR? (it is on the same branch that was recently merged.)

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 11, 2025

Both squash and commit and merge and commit has its own advantages and disadvantages. I think the most important thing is how it is done by the contributors and maintainers. Squash and commit for massive PRs might not be ideal, while merge and commit with a ton of messy commits is also not ideal (requires standardized practice).

I still like using squash and commit with xCDAT because we make smaller, well-defined features and bug fixes at this point in its lifecycle. One thing to note is that PRs will have all of the commits still even after the squash and commit (unless you delete the related branch). However, if you manually squash via git and force push before merging you will lose all of the history (I don't do this anymore)


In any case, here's ChatGPT's response. Squash and commit vs. merge and commit via ChatGPT:

The choice between squash and commit and merge and commit often depends on the development workflow and the goals of your version control strategy. Here's a breakdown of the two:
The choice between squash and commit and merge and commit often depends on the development workflow and the goals of your version control strategy. Here's a breakdown of the two:


Squash and Commit

  • What It Does: Combines all the commits from a feature branch into a single commit before adding it to the target branch (e.g., main or develop).
  • Use Case: Ideal for creating a clean, concise history in the target branch.
  • Advantages:
    • Results in a simpler commit history.
    • Makes it easier to review and understand changes as one cohesive unit.
    • Useful for small, focused feature branches or bug fixes.
  • Disadvantages:
    • Loses the granular history of the branch (e.g., step-by-step progress and debugging information).
    • Harder to debug if a problem arises later, as you can’t easily track individual steps.

When to Use:

  • When the branch contains many small, incremental, or experimental commits that don’t add significant value to the main history.
  • For repositories where clean commit history is prioritized.

Merge and Commit

  • What It Does: Combines the feature branch into the target branch with all its commits preserved. Often creates a merge commit to record the merge itself.
  • Use Case: Suitable for retaining the detailed history of changes made in the feature branch.
  • Advantages:
    • Preserves the complete commit history, including all intermediate commits.
    • Useful for complex changes where each commit is meaningful (e.g., multiple logical steps in a large feature).
    • Easier to debug because you can see the progression of changes over time.
  • Disadvantages:
    • Can result in a cluttered or noisy commit history.
    • Reviewers might need to sift through many small commits.

When to Use:

  • For long-running feature branches where each commit represents a meaningful step.
  • In repositories where preserving detailed commit history is important for auditing or debugging.

Choosing the Right Approach

  • Team's Workflow: Align with your team's Git workflow (e.g., GitHub Flow, GitLab Flow, or trunk-based development).
  • Project Size and Complexity: For large projects with multiple contributors, preserving detailed commit history might be crucial.
  • Repository Purpose:
    • For libraries or open-source projects, clean history (via squash) may be preferred.
    • For internal tools or experimentation, detailed history (via merge) might be more valuable.

Common Scenarios

  • Small feature or bug fix: Squash and commit to keep the main branch tidy.
  • Large feature or refactor: Merge and commit to preserve the full history.
  • Code Review Platforms: Some platforms allow you to choose during a pull/merge request, giving flexibility based on the context.

Would you like help with setting up a workflow or exploring more about either approach?

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 11, 2025

Also I used merge and commit for a large refactor effort before and we would manually go through the git history to squash things into logical groups before merging. I think if contributors have enough discipline on a small team, this can work nicely. However, it's probably a ton more work for larger open-source projects where people's commit habits are all different (e.g., Xarray uses squash and commit).

@xylar
Copy link
Contributor

xylar commented Jan 11, 2025

@tomvothecoder, thanks for sharing. Interesting thoughts.

However, it's probably a ton more work for larger open-source projects where people's commit habits are all different (e.g., Xarray uses squash and commit).

I feel like exactly such an environment is potentially where squashing commits causes a debugging nightmare if a subtle bug is introduced. E3SM itself is a place where squash commits would be a very bad idea, and that is in part because of the diverse habits of the developers.

@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 13, 2025

@tomvothecoder Thanks, this is really helpful.

I'm also copying the notes I put for today's team meeting:

Up until now, my general workflow has been

  • Push commits to the feature branch as I get little pieces working.
  • Merge using the Squash & Merge option on GitHub. (I had this set as the default, at least for me, but Tony’s default was the merge commit option. I then changed the repo settings to require Squash & Merge)
  • The end result is that main had very few commits – each commit having a 1:1 correspondence with a PR. Meanwhile, the commit history on the PR was more thorough. That said, I often squashed commits even on the feature branch, for conciseness.

On this thread, came up with this plan:

  • On a feature branch, make lots of tiny, incremental commits.
  • Before merging the PR, use git rebase -i to squash/fixup commits so that they form logical units of work (e.g., a single commit for “add plots” even if that work was done in several non-contiguous commits)
    • Concern 1: If we have commits A, B, C, and we’re trying to squash C into A, I could then see conflicts occurring when B is applied.
  • Merge the PR with a merge commit (i.e., not squash & merge). This will keep the commits representing logical units of work in the commit history.
    • Concern 2: That’s reasonable if there’s not concurrent work; but if we merge two PRs with lots of commits at around the same time, the commit history is going to be very intertwined. That is, the commit history is no longer linear.

Other notes on commit histories:

  • Does main’s commit history represent a concise story of its development or an engineering log of everything that ever happened?
  • Most importantly, should we force push to zstash’s main to keep a linear commit history or leave the merge commit and keep using merge commits going forward?
  • Benefit of a linear history: easier to see the big picture of what’s been changed (and by extension, easier to write release notes)
  • Benefit of merge commits: more fine-grained changes, easier to understand a smaller chunk of code at a time

I just pushed a small update to the tests3 README file (fixed several typos). Should I ask for a review? Do a PR? (it is on the same branch that was recently merged.)

@TonyB9000 Definitely do a PR. Wait to merge until we figure out how to proceed with the commit history. I'm planning to discuss with @chengzhuzhang and @golaz, but I'm still inclined to squash the commits on main, especially to be consistent with previous commits for this release.

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

Concern 1: If we have commits A, B, C, and we’re trying to squash C into A, I could then see conflicts occurring when B is applied.

This is why it is better to do git rebase -i often and not to try to intertwine unrelated development on the same file if you can avoid it. If things have got tangled up, it is time to allow yourself to have 2 commits with similar purpose, e.g. add plots and update plots.

Concern 2: That’s reasonable if there’s not concurrent work; but if we merge two PRs with lots of commits at around the same time, the commit history is going to be very intertwined. That is, the commit history is no longer linear.

This is not really a bad thing. It is one of the huge improvements in git over its predecessors like svn. If you have trouble looking at commit history because it's not linear, I recommend adding this alias to your .gitconfig:

[alias]
        logg = log --graph --oneline --decorate

With git logg, you will see in some pretty clear ascii art how the commits work together.

The goal should be a clean development workflow, not a clean git history. I still think the squash merge as a policy is too blunt an instrument and sacrifices a clean workflow.

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

Here's an example of what that can look like for overlapping feature development:
image
You can follow the color-coded lines to see where a feature branch started from, where its commits were added, and where it was merged back.

@forsyth2
Copy link
Collaborator

If things have got tangled up, it is time to allow yourself to have 2 commits with similar purpose, e.g. add plots and update plots.

That makes sense to me if we're doing finer-grained commits.

Ok, git logg does look nicer than how https://github.com/E3SM-Project/zstash/commits/main shows it. It would be nice if the GitHub web interface did something similar.

*   10a5b8f (HEAD -> main, upstream/main, upstream/HEAD, main_20250110) Merge pull request #355 from E3SM-Project/non-block-testing
|\  
| * 4a2a8df added tests3 scripts and README for blocking tests
| * c9ea937 (upstream/non-block-testing, non-block-testing) Fix pre-commit error
| * 7ed1f92 Unit tests passing
| *   5aef52d recommit after pull/merge of Ryan's changes
| |\  
| | * 633a543 (non-block-testing-before-20250106) Fix pre-commit checks and run tests
| * | 5738ff2 minor cleanup
| |/  
| * d7bae50 datetime and timezone unneeded here (ts_utc in util), and corrected comment in hpss
| * 0540e93 Made globus_block_wait a function, clarified log messages, revert parameters
| * 35b98d4 Passed non_blocking variable down call-chain, implemented loop for task_wait on blocking, force globus_transfer to return a sting status, added more logging.
| * f48a372 Passed non_blocking variable down call-chain, added more logging.
| * 969a7c1 Passed non_blocking variable down call-chain, implemented cache-delete-file when blocking success, added more logging.
| * 768e3a9 Added more log messaging
| * a434a36 added ts_utc() function for convenient UTC timestamps
|/  
* 082afc7 Update pre-commit dependencies (#353)
* ef8372a Tweak behaviors of build and release CI workflows (#352)
* 18c84f1 Update supported python versions (#351)
* 74fcc84 Add symlink tests and refactor for new tests (#348)
* 98e6ae1 Add PR template (#347)
* 6894798 (test-post-343) Fix symlink behavior (#343)
* ef87eaf Add DOI badge (#342)

@xylar Re: plan going forward -- so your strong preference would be 1) leave the merge commit from this PR as-is, 2) use merge commits exclusively going forward. Is that correct?

@forsyth2
Copy link
Collaborator

Oh it occurs to me my other big problem with having lots of commits is if main changes significantly. It's way easier to squash the commits on a feature branch into one, and then address merge conflicts all at once, rather than iteratively through n commits.

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

@xylar Re: plan going forward -- so your strong preference would be 1) leave the merge commit from this PR as-is, 2) use merge commits exclusively going forward. Is that correct?

My preference would just be not to have a policy of using exclusively squash merge. I wanted to argue for why it should be an option to use merge commits. You can still use squash-merge when not too much info is lost in the process.

Oh it occurs to me my other big problem with having lots of commits is if main changes significantly. It's way easier to squash the commits on a feature branch into one, and then address merge conflicts all at once, rather than iteratively through n commits.

Ah, you've got me there. I do try to rebase first to squash as many commits as I reasonably can but then the process of rebasing onto main can be painful. Sometimes that's when I throw up my hands and start a new branch where I copy over my changes manually.

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

It would be nice if the GitHub web interface did something similar.

Apparently it can be done:
https://chromewebstore.google.com/detail/le-git-graph-commits-grap/joggkdfebigddmaagckekihhfncdobff

@forsyth2
Copy link
Collaborator

not to have a policy of using exclusively squash merge

Ok so it seems like the general plan should be squash & merge for small changes, merge commits for large changes, but avoid strict policies. That seems reasonable to me. I can update the git branch protections accordingly.

I do try to rebase first to squash as many commits as I reasonably can but then the process of rebasing onto main can be painful.

Yeah I guess in this case it might make sense to use git rebase -i to at least shrink the number of commits to units of work rather than minimal changes like "add import here" or "fix spacing", so there are fewer, if not one, commits to address conflicts on.

Sometimes that's when I throw up my hands and start a new branch where I copy over my changes manually.

Oh wow! That sounds pretty tedious though.

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

The chrome extension works but I still like the command-line version better:
image

@rljacob
Copy link
Member

rljacob commented Jan 13, 2025

The main model uses merge commits. "git log --first-parent" gives you a linear commit history by showing you only the merge commits.

There are also standalone applications like GitUp that let you browse the branches graphically.

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

like Github

I guess that's Gitup, not Github?

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

One more situation where squash-merge is undesirable is when multiple people are involved in development. Some (or maybe even all!) lose credit in that process.

@forsyth2
Copy link
Collaborator

One more situation where squash-merge is undesirable is when multiple people are involved in development. Some (or maybe even all!) lose credit in that process.

I've noticed on Squash & Merge, it will list everyone who authored any of the commits as an author, but you're right that which person did which work gets lost

@xylar
Copy link
Contributor

xylar commented Jan 13, 2025

Ah, still, better than losing authors entirely.

@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 13, 2025

@xylar @tomvothecoder Conclusions from meeting with @TonyB9000, @chengzhuzhang, @golaz:

  • Conclusion for this PR is to simply leave Tony’s merge commit as-is.
  • Going forward, commit type should be case-by-case, not standard policy. Individual developers can use their best judgment.
    • So I'll re-enable the other merge types on the repos then.
  • General guideline: PR is for adding a single feature, but a PR might be multiple commits if the feature has a number of distinct parts
    • Squashing commits on a feature branch is fine to consolidate coherent chunks of work.

Thanks for this discussion; it was very informative.

@TonyB9000
Copy link
Collaborator Author

@forsyth2 I pushed my update to the tests3/README documentation last week - but there is no "Do a PR" option where I usually find one (here). Perhaps this is because I pushed to a branch already merged? How to proceed?

@forsyth2
Copy link
Collaborator

Perhaps this is because I pushed to a branch already merged?

That would be my guess

How to proceed?

From the command line:

# Make sure you're on the branch
git checkout non-block-testing

# Change the branch's name
git branch -m non-block-testing-fix

# Push the new branch name
git push origin non-block-testing-fix

@forsyth2
Copy link
Collaborator

So I'll re-enable the other merge types on the repos then.

Actually I still think it might be good to disallow rebase & merge -- the other two at least have a very clear "this PR introduced this change" connection. The rebase & merge option adds commits individually which I think could get very messy.

Screenshot 2025-01-13 at 3 06 25 PM

@forsyth2 forsyth2 mentioned this pull request Jan 13, 2025
4 tasks
@xylar
Copy link
Contributor

xylar commented Jan 14, 2025

That's right, I don't think the rebase merge is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zstash is always non-blocking
6 participants