-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive/tar: output is nondeterministic #12358
Comments
CL https://golang.org/cl/13975 mentions this issue. |
Replacing that with zero just to get deterministic output is a poor move. Deterministic output is nice, but the first goal of It does seem reasonable to provide a way to override that default, though. A new field in the |
How would setting the pid field to zero cause the generated archive to be any less valid? You cannot rely on PIDs being unique or having any particular value, so this field is already a completely arbitrary number. More importantly, the POSIX document is a specification for the behavior of the pax(1) command. I don't see how defaults for that command are even relevant to this package. |
Now that I've read The POSIX doc for Why does |
I think you're right that the only documentation for pax is the man page in the POSIX spec! I was curious so I googled around until I found something resembling an explanation for including the process ID in the header name. Here's what I found in [ftp://std.dkuug.dk/jtc1/sc22/def/n3511.pdf]: 1916 Rationale: Previously there was no way for the implementation I don't understand the problem, much less how the process ID of the pax process which /created/ the archive is the solution. As someone pointed out on golang-dev, what actually got added was an /option/ for the user of pax to include the process ID in the name, as well as a default setting which includes it: 1896 On Page: 701 Line: 27030-27065 Section: pax As a user of POSIX pax, you're free to set that format string to whatever you like. How could the extracting process possibly rely on it then? I took a quick peek at gnu tar and bsd tar. gnu tar respects POSIX and includes the flag and its default %d/PaxHeaders.%p/%f when creating a header, but doesn't otherwise touch the value in exthdr.name. bsd tar doesn't even provide a flag and just puts the equivalent of %d/PaxHeaders/%f into exthdr.name. It, too, doesn't ever look at that field when reading an archive. The bsd tar man page points out the pax interchange format is a valid ustar archive, and "older implementations that do not fully support these It appears that the exthdr.name field's only purpose is to look like an ordinary ustar header so that it will be extracted as a file/directory by older tools with a name that obviously suggests something went wrong. I have to side with the BSD guys here and suggest dropping the process ID entirely from the name. I don't think it matters what we put there, since it doesn't look like any tool out there examines that field, much less parses it. I think dir/PaxHeaders/file is a reasonable choice of what the put there. Older tools can read it and treat it like an ordinary file. Tools that know about PAX will completely ignore it. |
As mentioned above, the reason PID is used is only for extractors that do not support PAX. Since the tar spec says that unknown type flags be treated as regular files, then the TypeXHeader special files used by PAX will be dropped a regular file, with the contents of the PAX header "file" being in a humanly readable format. This is what would hypothetically happen when extracting two conflicting archives back-to-back using a dumb tar that doesn't support pax:
The PID is a crude method of separating the header files dropped by multiple invocations of an extraction that may have files that collide. This does allow the PaxHeaders from different invocations to be identified, but it's not that great of a solution anyways since tar extractors typically overwrite previous dropped files anyways. Furthermore, the PAX specification does not even mandate that the value must be of the "%d/PaxHeaders.%p/%f", it just defaults to that. Thus, it is up to the implementation to choose whatever value they want. Not to mention that the format chosen by PAX is problematic when the directory name is longer than 100 characters anyways. In that situation the "PaxHeader.*" string will be truncated off anyways. My vote is to have ID be hardcoded to 0. |
@dsymonds Can you comment on this issue? Deterministic file generation is useful even for unit tests for the tar package itself. I noticed that the package is missing tests to lock in correct PAX file generation. |
I think hard coding 0 is fine. |
To expand my previous comment, generating deterministic output, especially in code conceivably used in builds, trumps whatever old rationales might have been presented by POSIX for this or that behavior. For much the same reason, we store a zero time stamp in Windows PE executables when generating them in the linker, despite this being not what one is Supposed To Do. |
A program that uses archive/tar doesn't produce identical output even for identical inputs because writePAXHeader encodes the current process ID into the output.
The reason the code does this is that POSIX.1 suggests using this behavior by default. See http://pubs.opengroup.org/onlinepubs/009695299/utilities/pax.html, exthdr.name. That might be a reasonable thing for a command-line 'tar' utility, and it's intended only as a default, but it's a horrible hardwired behavior for an archive package's API.
I think we should replace this field with zero.
The text was updated successfully, but these errors were encountered: