Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: handle hardlinks symmetrically #19

Open
wants to merge 4 commits into
base: handle-hardlinks
Choose a base branch
from

Conversation

zhijie-yang
Copy link

@zhijie-yang zhijie-yang commented Sep 5, 2024

  • Have you signed the CLA?

Copy link

@rebornplusplus rebornplusplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes, thank you! Left a few comments below.

Comment on lines +41 to +42
// The identifier for the hardlink is unique for each unique base file,
// which counts from 1. The 0 value is reserved for files that are not hard links.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to explain the "base file" since that's something we came up with, or try to think of a better name.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, it is probably "base file" == "inode".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this means "The identifier for the hard link should be unique for all the file entries in the same hard link group"

}

type tarMetadata struct {
HardLinkRevMap map[string]hardLinkRevMapEntry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment about this map would be helpful.

// getDataReader returns a ReadCloser for the data payload of the package.
// Calling the Close method must happen outside of this function to prevent
// premature closing of the underlying package reader.
// The xz.Reader is wrapper with io.NopCloser since it does not implement the Close method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The xz.Reader is wrapper with io.NopCloser since it does not implement the Close method.
// The xz.Reader is wrapped with io.NopCloser since it does not implement the Close method.

nitpick

if err != nil {
return err
}
// Fist pass over the tarball to read the header of the tarball

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Fist pass over the tarball to read the header of the tarball
// First pass over the tarball to read the header of the tarball

nitpick

@@ -153,6 +197,9 @@ func extractData(dataReader io.Reader, options *ExtractOptions) error {
return err
}

// targetDir is the directory where the file is extracted.
// It is either the options.TargetDir or the stagingDir.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// It is either the options.TargetDir or the stagingDir.
// It is either the options.TargetDir or the stagingDir depending on the file.

// Nothing to do.
continue
// Extract the hard link base file to the staging directory, when
// 1. it is not part of the target paths (len(targetPaths) == 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably put it like "it's not listed in the slice definition file".

// 2. it is required by other hard links (exists as a key in the HardLinkRevMap)
// In case that [len(targetPaths) > 0], the hard link base file is extracted normally.
// Note that the hard link base file can also be a symlink.
// tarHeader.Name is used here since the paths in the HardLinkRevMap are relative.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible and if it makes sense, let's try to use absolute paths everywhere to avoid confusion.

Comment on lines +377 to +381
func newTarMetadata() tarMetadata {
return tarMetadata{
HardLinkRevMap: make(map[string]hardLinkRevMapEntry),
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need this function since we are using it once and it's quite minimal.

Comment on lines +229 to +231
if err != nil {
return nil, err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be left out? Not sure which error you are checking.

"/hardlink": "hardlink 1 {test-package_myslice}",
},
}, {
summary: "Symlink is a valid hard link base file",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
summary: "Symlink is a valid hard link base file",
summary: "Hard links can point to symlinks",

Copy link
Owner

@letFunny letFunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments about the general approach, once we discuss those we can discuss the details. Thanks!

Comment on lines +41 to +42
// The identifier for the hardlink is unique for each unique base file,
// which counts from 1. The 0 value is reserved for files that are not hard links.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, it is probably "base file" == "inode".

Comment on lines +154 to +155
tarReader := tar.NewReader(dataReader)
tarMetadata, err := readTarMetadata(tarReader)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always going to read the file twice even if there are no hardlinks. I think a better design is to do the extraction for all files in the first loop and keep track of the hardlinks that we cannot extract then, only if necessary, do we loop over the tarball again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not regarding "Whether or not we can extract the hardlink", but "How can we have all the files extracted from a hardlink group to be represented symmetrically in the report". Think about a counter-example: When we meet the "regular file" (which is the target of other hardlinks) in the first pass reading the tarball, we don't know if it is associated with a hardlink group. Then in the entry returned by fsutil.Create we can only report it as a regular file. Now as we proceed the rest part of the tarball, we meet the hardlinks associated with this "regular file", we are not able to update the hardLinkId field in the entry for the "regular file", as the fsutil.Create function has already left the scope processing the "regular file".

Comment on lines +101 to +104
// getDataReader returns a ReadCloser for the data payload of the package.
// Calling the Close method must happen outside of this function to prevent
// premature closing of the underlying package reader.
// The xz.Reader is wrapper with io.NopCloser since it does not implement the Close method.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion all of the comment here is unnecessary because it is a mix of Go idiosyncrasies and implementation details. The comment on the function should not state things like "xz.Reader is wrapper with io.NopCloser...", that is something internal that the user does not care about. And the bit about "Calling the Close method must happen outside of this function..." is also something that is taken for granted in Go.

func extractData(pkgReader io.ReadSeeker, options *ExtractOptions) error {
// stagingDir is the directory where the hard link base file, which is not
// listed in the pendingPaths, to be extracted.
stagingDir, err := os.MkdirTemp("", "chisel-staging-")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a staging dir in the first place? I think it should not be needed at all. Correct me if I am wrong but:

  1. First iteration over the tarball we extract all entries that are not hardlinks (and possibly hardlinks depending on whether the content entry was extracted).
  2. After the first iteration we have a list of hardlinks and the path they should point to according to the tar, that is a map from target -> list of hardlinks.
  3. On the second iteration we find target on the tarball. We extract target to list of hardlinks [0]. Then we iterate over the remaining elements and create the links from hardlinks[0] to hardlinks[1..].

As I said in 1. there is also the optimization that target might be extracted so there is no need for a second iteration. The downside is that we need to keep track of the names of the extracted files to know if target was extracted and we already do that in the report, and exposing a good API here might be a challenge, we have to think about that.

In fact, because this is a "niche" use case and we do not anticipate it happening a lot, and because the optimization mentioned in 1. is even more niche, I would say let's not worry about it unless it is easy to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants