-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: handle hardlinks symmetrically #19
base: handle-hardlinks
Are you sure you want to change the base?
feat: handle hardlinks symmetrically #19
Conversation
zhijie-yang
commented
Sep 5, 2024
•
edited
Loading
edited
- Have you signed the CLA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice changes, thank you! Left a few comments below.
// The identifier for the hardlink is unique for each unique base file, | ||
// which counts from 1. The 0 value is reserved for files that are not hard links. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to explain the "base file" since that's something we came up with, or try to think of a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, it is probably "base file" == "inode".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this means "The identifier for the hard link should be unique for all the file entries in the same hard link group"
} | ||
|
||
type tarMetadata struct { | ||
HardLinkRevMap map[string]hardLinkRevMapEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment about this map would be helpful.
// getDataReader returns a ReadCloser for the data payload of the package. | ||
// Calling the Close method must happen outside of this function to prevent | ||
// premature closing of the underlying package reader. | ||
// The xz.Reader is wrapper with io.NopCloser since it does not implement the Close method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The xz.Reader is wrapper with io.NopCloser since it does not implement the Close method. | |
// The xz.Reader is wrapped with io.NopCloser since it does not implement the Close method. |
nitpick
if err != nil { | ||
return err | ||
} | ||
// Fist pass over the tarball to read the header of the tarball |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Fist pass over the tarball to read the header of the tarball | |
// First pass over the tarball to read the header of the tarball |
nitpick
@@ -153,6 +197,9 @@ func extractData(dataReader io.Reader, options *ExtractOptions) error { | |||
return err | |||
} | |||
|
|||
// targetDir is the directory where the file is extracted. | |||
// It is either the options.TargetDir or the stagingDir. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// It is either the options.TargetDir or the stagingDir. | |
// It is either the options.TargetDir or the stagingDir depending on the file. |
// Nothing to do. | ||
continue | ||
// Extract the hard link base file to the staging directory, when | ||
// 1. it is not part of the target paths (len(targetPaths) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably put it like "it's not listed in the slice definition file".
// 2. it is required by other hard links (exists as a key in the HardLinkRevMap) | ||
// In case that [len(targetPaths) > 0], the hard link base file is extracted normally. | ||
// Note that the hard link base file can also be a symlink. | ||
// tarHeader.Name is used here since the paths in the HardLinkRevMap are relative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's possible and if it makes sense, let's try to use absolute paths everywhere to avoid confusion.
func newTarMetadata() tarMetadata { | ||
return tarMetadata{ | ||
HardLinkRevMap: make(map[string]hardLinkRevMapEntry), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't need this function since we are using it once and it's quite minimal.
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be left out? Not sure which error you are checking.
"/hardlink": "hardlink 1 {test-package_myslice}", | ||
}, | ||
}, { | ||
summary: "Symlink is a valid hard link base file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary: "Symlink is a valid hard link base file", | |
summary: "Hard links can point to symlinks", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments about the general approach, once we discuss those we can discuss the details. Thanks!
// The identifier for the hardlink is unique for each unique base file, | ||
// which counts from 1. The 0 value is reserved for files that are not hard links. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, it is probably "base file" == "inode".
tarReader := tar.NewReader(dataReader) | ||
tarMetadata, err := readTarMetadata(tarReader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always going to read the file twice even if there are no hardlinks. I think a better design is to do the extraction for all files in the first loop and keep track of the hardlinks that we cannot extract then, only if necessary, do we loop over the tarball again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not regarding "Whether or not we can extract the hardlink", but "How can we have all the files extracted from a hardlink group to be represented symmetrically in the report". Think about a counter-example: When we meet the "regular file" (which is the target of other hardlinks) in the first pass reading the tarball, we don't know if it is associated with a hardlink group. Then in the entry
returned by fsutil.Create
we can only report it as a regular file. Now as we proceed the rest part of the tarball, we meet the hardlinks associated with this "regular file", we are not able to update the hardLinkId
field in the entry
for the "regular file", as the fsutil.Create
function has already left the scope processing the "regular file".
// getDataReader returns a ReadCloser for the data payload of the package. | ||
// Calling the Close method must happen outside of this function to prevent | ||
// premature closing of the underlying package reader. | ||
// The xz.Reader is wrapper with io.NopCloser since it does not implement the Close method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion all of the comment here is unnecessary because it is a mix of Go idiosyncrasies and implementation details. The comment on the function should not state things like "xz.Reader is wrapper with io.NopCloser...", that is something internal that the user does not care about. And the bit about "Calling the Close method must happen outside of this function..." is also something that is taken for granted in Go.
func extractData(pkgReader io.ReadSeeker, options *ExtractOptions) error { | ||
// stagingDir is the directory where the hard link base file, which is not | ||
// listed in the pendingPaths, to be extracted. | ||
stagingDir, err := os.MkdirTemp("", "chisel-staging-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a staging dir in the first place? I think it should not be needed at all. Correct me if I am wrong but:
- First iteration over the tarball we extract all entries that are not hardlinks (and possibly hardlinks depending on whether the content entry was extracted).
- After the first iteration we have a list of hardlinks and the path they should point to according to the tar, that is a map from
target
->list of hardlinks
. - On the second iteration we find target on the tarball. We extract target to
list of hardlinks [0]
. Then we iterate over the remaining elements and create the links fromhardlinks[0]
tohardlinks[1..]
.
As I said in 1. there is also the optimization that target might be extracted so there is no need for a second iteration. The downside is that we need to keep track of the names of the extracted files to know if target was extracted and we already do that in the report, and exposing a good API here might be a challenge, we have to think about that.
In fact, because this is a "niche" use case and we do not anticipate it happening a lot, and because the optimization mentioned in 1. is even more niche, I would say let's not worry about it unless it is easy to implement.