Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor xopen #143

Closed
wants to merge 40 commits into from
Closed

Refactor xopen #143

wants to merge 40 commits into from

Conversation

rhpvorderman
Copy link
Collaborator

@rhpvorderman rhpvorderman commented Jan 22, 2024

fixes #141

So I started with making PipedCompressionReader and -Writer binary. This was a refactor that saved quite a few lines, and after that I just kept going.

Changes:

  • Make PipedcompressionReader and -Writer binary only. All the _open_format functions only return binary streams. The xopen function wraps these streams in an io.TextIOWrapper if the requested stream is a text stream. The advantage of this model is that text encoding arguments do not have to be passed through the application. Change compared to main).
  • PipedCompressionReader and -Writer share a lot of code. Create a PipedCompressionProgram class that can do everything. This allows common paths for error handling etc. Saves quite a lot of lines. Change compared to the previous change
  • All these extra classes for each type of program are repetitive. Create a settings object, make that a required input for PipedCompressionProgram and simply create a dictionary that binds every known program to a settings object. Change compared to the previous change

As you can see the changing is iterative so the definitive edition is in this branch. I made the other branches to make it easier to look at each change in isolation. For your reviewing comfort.

Before: __init__.py 1296 lines
After: __init__.py: 734 lines.

The xopen API is unchanged. The test_xopen file contains no changes. Internally I made quite a lot of backwards-incompatible changes. This is only problematic if people were using our internals before. (I only ever used xopen, so I don't think this is very likeley). Nevertheless a signaling with a version 2.0.0 wouldn't hurt. I made _PipedCompressionProgram start with an _ so it is clearly signalled that this is an internal and not a public API.

EDIT: This is purely solving technical debt. So it is not necessary to review this quickly. Mabe we should make a new release first with all the zstandard changes?

@rhpvorderman rhpvorderman requested a review from marcelm January 22, 2024 15:12
@marcelm
Copy link
Collaborator

marcelm commented Jan 29, 2024

This is quite a big PR and I’m not so comfortable reviewing it as-is. I’d rather review many PRs with few commits rather than 1 PR with many commits. Do you think you can split this up a bit? You already have the three branches.

Regarding Git history, I noticed that there are a lot of "Fix [something]" commits and many commits that do not pass CI.

I call these "half commits" because they are not self contained. Ideally, no commit should break CI. Half commits lead to a noisy Git history and make git bisect harder to use. Reviewing PRs becomes harder. The commit message "Fix [...]" is a bit misleading if seen in isolation because the commit doesn’t actually fix a bug in the software, it just fixes an earlier commit within the same PR. A pure refactoring PR shouldn’t fix bugs.

I don’t want to ask you to perfectly clean up the history here because it is more work to do this afterwards than to do it while working. That said, here are a couple of strategies that I use for manipulating Git history so that it looks nice. You probably do some of these, but I thought I’d just write down what came to my mind.

Let’s say I work on refactor B and realize I need to do refactor A beforehand. I either start a new branch or use git stash to stow away my current, half-finished changes. Only after A passes CI and is committed will I go back to B. The final Git history will only contain commit A, then B.

I very often use git add -p to ensure I only add things strictly related to the commit.

I try to run all tests after each commit. (Since this is a bit boring, I sometimes don’t, but I also regret it sometimes.) Local tests should be run on the clean repository, so I use git stash beforehand.

When I realize that the most recent commit does not pass CI, I fix the problem and use git commit --amend to squash the two together.

If I notice that commit earlier than the immediate preceding is broken, I make a "fix" commit, then use rebase to squash the broken and "fix" commit.

I make lots of temporary commits that I mark with "fix ..." or "wip ..." so that I will remember to clean them up later.

I use interactive rebase (git rebase -i) a lot to reorder commits or to clean up the "fix" and "wip" commits. I start normal commit messages with an uppercase letter, so the lowercase "fix" and "wip" stand out, which helps me remember that I need to squash them with some other commit before pushing.

@rhpvorderman
Copy link
Collaborator Author

Thanks for the feedback. I see where you are coming from. I will do this in separate PRs with a more clean git history. Since the changes are individually not so big (less than 200 lines) I can each put them in a separate commit.

I really should be more mindful of my git behavior. Currently I spam quite a lot of commits. Something to work on for 2024!

@rhpvorderman
Copy link
Collaborator Author

Created #145 . I will close this as this PR will not be pulled as is.

@rhpvorderman rhpvorderman deleted the refactor branch March 25, 2024 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipedCompressionReader and PipedCompressionWriter should be binary IO objects only.
2 participants