Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to move files to the framework #145

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

davidhorstmann-arm
Copy link
Contributor

Pushing a small script that moves files to the framework repo from Mbed TLS, while retaining history.

This script works but may not be in a fully-polished state.

This is useful for building file-moves on top of other not-yet-merged
file moves.

Signed-off-by: David Horstmann <[email protected]>
When there are no dirs to make we try anyway.

This is bad and causes an error. Fix this.

Signed-off-by: David Horstmann <[email protected]>
To allow branches to be built on other branches, resolve conflicts
'manually' when the 2 branches delete each others' files.

Signed-off-by: David Horstmann <[email protected]>
Allow passing the --except flag, which stops a file from being moved
(and perform the necessary path manipulation acrobatics to make that
work).

Created to be able to move tests/include except for
tests/include/alt-dummy, like so:

--path tests/include:include --except tests/include/alt-dummy

Signed-off-by: David Horstmann <[email protected]>
This was failing if no exceptions were supplied.

Signed-off-by: David Horstmann <[email protected]>
This causes the script to not perform the operation, but simply print
the self._file_map dictionary of files to rename (this is useful for
debugging the file-moving logic, especially with combinations of
--path and --except options).

Signed-off-by: David Horstmann <[email protected]>
When files are moved to the exact same location in the framework repo
as they were in Mbed TLS, there is a problem.

Since no git change has taken place on the Mbed TLS side, the most
recent change to the file is its deletion when a previous set of files
were moved to the framework (since all other files are deleted in this
kind of move).

As a result the moved files are deleted and do not appear in the final
branch. Deal with this edge case explicitly by taking files which are
not renamed from the file map and checking them out to the state in
the temporary branch, before amending the resulting commit.

Signed-off-by: David Horstmann <[email protected]>
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a mostly full code review, more than we usually do in mbedtls-docs. Arguably this script could live in the framework repository; for that level of quality, I'd want most of my comments to be handled. For the level of quality in mbedtls-docs, all my comments are optional except one: this script seems to mess up if the destination repo is the framework subdirectory of the source.

def add_file_move(self, src_path: str, dst_path: str) -> None:
"""Move file at relative path src_path in the source repo to dst_path
in the destination repo"""
self._file_map[src_path] = dst_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no normalization like there is for the initialization of _file_map?

cmd = [self.GIT_EXE] + git_cmd
return subprocess.check_output(cmd, **kwargs)

def add_file_move(self, src_path: str, dst_path: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method isn't used anywhere. __init__ should probably call it instead of setting _file_map on its own.

def __init__(self, source_repo: str, dest_repo: str,
source_branch_name: str, dest_branch_name: str,
file_map: Dict[str, str],
dest_base_branch_name: Optional[str] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to document the parameters. It's not obvious what you can put in file_map, or the difference between dest_branch_name and dest_base_branch_name.

def _direct_children(self, dir_path: str) -> List[str]:
"""Get the direct children of a subdirectory (at least the ones
that matter to git)"""
leaf_files = self.run_git(['ls-files', dir_path]).split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever have a file name containing a space, this will split on the space. (Control characters in file names would also not work properly.) The robust way would be

Suggested change
leaf_files = self.run_git(['ls-files', dir_path]).split()
leaf_files = self.run_git(['ls-files', '-z', '--', dir_path]).rstrip('\0').split('\0')

This applies to the other uses of ls-files.

It would be good to make an auxiliary method that calls ls-files and returns a list.

Comment on lines 73 to 80
leaf_files = self.run_git(['ls-files', dir_path]).split()
path_len = len(self._path_parts(dir_path)) + 1
direct_children = set([
os.path.join(
*self._path_parts(f)[:path_len]
)
for f in leaf_files
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be equivalent but simpler: run git -C dir_path ls-files and strip anything after a slash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though looking at how this method is used, it seems that you're stripping the paths down to base names here, only to reconstruct the full path later in _expand_dir_move?

Copy link
Contributor Author

@davidhorstmann-arm davidhorstmann-arm Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part removes the suffix that git ls-files outputs. We might get:

foo/bar/a.c
foo/bar/b.c
foo/baz

Which we reduce to just the direct descendants of foo, so:

foo/bar
foo/baz

We never reconstruct the suffix in _expand_dir_move(), but we substitute the prefix, e.g. if foo is being moved to new_foo:

new_foo/bar
new_foo/baz

This is a part of the transformation required for the --except option.

I was not aware that passing -C with a directory had this effect though, that will certainly simplify this.

def main() -> int:
"""Command line entry point."""
parser = argparse.ArgumentParser()
parser.add_argument('--path', action='append', nargs='?',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than require --path to be repeated for each path, I think this should be the positional arguments. In particular it would allow calls like mbedtls-move-to-framework … subdir/*.py.

GIT_EXE = 'git'
FRAMEWORK_DEFAULT_BASE_BRANCH = 'main'

def __init__(self, source_repo: str, dest_repo: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be consistent when abbreviating “destination”: either dst or dest, not both.

Comment on lines +295 to +296
parser.add_argument('--dst-base-branch', default=None, required=False,
help='Base branch in the framework repo to make changes from')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't actually need to be a branch name, it can be any revision. In practice I'd want whatever is the current head of the framework submodule, and I think that should be the default, rather than main.

In fact main is likely not to work in practice, because it'll typically be some old reference made when the working directory was originally cloned I delete the default branch name in my local clones to avoid keeping that old reference around, so when I didn't pass --dst-branch-name, I got the error

fatal: 'main' matched multiple (4) remote tracking branches

self.run_git(['checkout', self._framework_base_branch_name])

# Create a new branch with checkout -b
self.run_git(['checkout', '-b', self._framework_branch_name])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add checks for common errors before starting to do anything. At least check that the branches don't already exist, which is likely to be the case the second time the same person uses the script.

# Delete all files except the ones we are moving
for f in repo_files:
if os.path.normpath(f) not in files_to_retain:
self.run_git(['rm', f])
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKER: As far as I understand it (more info in Slack thread), this will in particular remove framework. So from that point, when checking out the original head again, the framework submodule is no longer initialized. Then later the framework directory itself disappears, which I don't understand.

I don't fully understand what's going on, but as far as I can tell, you can't have the destination be the framework directory of the source, it has to be a separate worktree. This is a very unintuitive limitation that needs to be at least documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now documented and I have added a check so that an error message is printed if you try to move files to a framework submodule within the Mbed TLS checkout that you are moving from.

@gilles-peskine-arm
Copy link
Contributor

I'm marking this as priority-high because we need this to move files to the framework with the history preserved. Arguably this should be in the framework repository as an essential maintainer tool (even if it'll only be essential for a few months).

self.run_git(['mv', f, self._file_map[f]])

# Commit the result
commit_message = "Move some files to framework repository"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like the ability to specify a commit message. I can do it afterwards by rebasing, and that's not really a problem in the source branch, but in the destination branch, the commit message is behind a merge and git is reluctant to edit history beyond a merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, if I want to edit the commit message, I need to redo the conflict resolution which is implemented here as resolve_deletion_conflicts. That's not easy!

@gilles-peskine-arm
Copy link
Contributor

I've now used the script successfully. As far as I can't tell it just can't work if the destination is the framework subdirectory of an mbedtls worktree. If the source is an mbedtls worktree, the script can work, but the source worktree is disrupted because the framework submodule gets deinitialized, and the resulting history may be wrong because any file that is neither tracked nor ignored gets added.

So this script should come at least with a warning, if not enforcement, that both the source and destination should be independent worktrees, freshly checked out, that are not used for doing anything else. Preferably, the script should create these worktrees itself (as mbedtls-rewrite-branch-style does, we can reuse some of its code).

@mpg
Copy link
Contributor

mpg commented Oct 10, 2024

I've not yet used this script myself, but looking at its result there seems to be extra commits that I wouldn't expect to be present in the history, see Mbed-TLS/mbedtls-framework#55 (comment)

@mpg
Copy link
Contributor

mpg commented Oct 10, 2024

I'm marking this as priority-high because we need this to move files to the framework with the history preserved. Arguably this should be in the framework repository as an essential maintainer tool (even if it'll only be essential for a few months).

Agreed. I've created Mbed-TLS/mbedtls-framework#57 and added it to the first framework epic. (I don't care much about where the script lives, as long as it meets quality requirements.)

@gilles-peskine-arm
Copy link
Contributor

I've not yet used this script myself, but looking at its result there seems to be extra commits that I wouldn't expect to be present in the history

We're preserving the commit history, not just the patch history. So each import branch must have the full history of the project up to the last commit that modifies one of the imported files.

The first time we did this in a pull request, GitHub showed basically all the mbedtls history as part of the pull request, because these were all new commits in the main branch of the framework. In each subsequent pull request that imports files into the framework, GitHub just shows recent commits that aren't yet in one of the imports.

There are tools that can rewrite a git branch to only retain commits that modify specific files, and construct a commit history with the same messages and the same patches for this subset of files. But you lose commit IDs (obviously), renamed files, content moved between files, etc. And these tools are harder to use. So I think we've made the right choice.

@mpg
Copy link
Contributor

mpg commented Oct 10, 2024

Ah OK, I had misunderstood what the script does.

There are tools that can rewrite a git branch to only retain commits that modify specific files, and construct a commit history with the same messages and the same patches for this subset of files. But you lose commit IDs (obviously), renamed files, content moved between files, etc. And these tools are harder to use. So I think we've made the right choice.

Yeah, that's what I thought the script was doing, and indeed now that you mention it that wouldn't be the right thing to do. Thanks for clearing that up.

This allows the user to write a message that they prefer. While we're at
it, we can supply a helpful template so that it's a bit clearer what
each commit message is for.

Signed-off-by: David Horstmann <[email protected]>
Instead of passing a dictionary file map when creating a RepoFileMover
object, remove this from the constructor.

The correct way to add a file move is via the add_file_move() method (as
this properly encapsulates the file map).

Signed-off-by: David Horstmann <[email protected]>
Change _direct_children() to return only the child directory names (sans
prefix), which can be done easily with:

git -C dir ls-files

Also reconstruct paths more explicitly in _expand_dir_move().

Signed-off-by: David Horstmann <[email protected]>
Replace some statements with equivalent, simpler alternatives.

Signed-off-by: David Horstmann <[email protected]>
Update the description to be more accurate. In particular, document the
limitation that a separate checkout is required.

Signed-off-by: David Horstmann <[email protected]>
If the user tries to move files to the internal framework checkout,
print an error and exit.

Signed-off-by: David Horstmann <[email protected]>
Set nargs=1 for each of these arguments. Since this will return a
single-element list containing the argument (not the argument itself),
change the action to 'extend', which will concatenate lists rather than
appending individual values to a list.

Signed-off-by: David Horstmann <[email protected]>
@davidhorstmann-arm
Copy link
Contributor Author

@mpg I believe I have addressed the main blocker and a few other improvements, so this should now be in a reasonable state to be handed over to someone else for followup.

@mpg
Copy link
Contributor

mpg commented Dec 9, 2024

@gilles-peskine-arm Unless you disagree, I'd like to merge this script mostly as it stands now: I think your main comment has been addressed, and in practice we (at least Elena, Valerio and I) have been using this script a few times now without trouble.

Of course that doesn't mean we won't improve it later on if usability or robustness issues start being a problem. But this can be done as a different PR (probably by a different author now that David have moved to other priorities).

Could you give this PR a second review with that in mind?

@gilles-peskine-arm gilles-peskine-arm dismissed their stale review December 9, 2024 12:56

The script now refuses to run if the destination is a submodule of the source, rather than failing mysteriously, so my blocker comment is resolved.

@gilles-peskine-arm
Copy link
Contributor

I'm sorry, I can't promise a review right now. On the other hand, I can see that my one blocker comment has been addressed: the script won't fail in a mysterious way if you try to use it in what is to me the most natural scenario, namely moving files to a submodule of the current checkout. So I've dismissed my review. Since you seem to be happy with it, I'll go ahead and merge.

@gilles-peskine-arm
Copy link
Contributor

I'll go ahead and merge.

Well, not quite: I don't feel comfortable merging it without doing at least a cursory review. I'll try to do that later, but if someone else feels comfortable approving this PR, please go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-high size-m Estimated task size: medium (~1w)
Projects
Development

Successfully merging this pull request may close these issues.

3 participants