Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate file paths not accepted by pkg_tar #849

Closed
eejayes opened this issue Apr 11, 2024 · 6 comments
Closed

Duplicate file paths not accepted by pkg_tar #849

eejayes opened this issue Apr 11, 2024 · 6 comments

Comments

@eejayes
Copy link
Contributor

eejayes commented Apr 11, 2024

When tar archives which contain duplicate file paths are assigned to deps of pkg_tar, the function removes all but the first occurring instance, and warns with "Duplicate file in archive picking first occurrence". This contrasts with the common GNU tar utility, where "tar allows you to have infinite number of files with the same name" according to document for --append under The Five Advanced tar Operations, where unpacking of the archive will yield the last occurring instance of the file path.

My use case involves doing tar archive augmentation through a series of build steps, where one of these operations is to overwrite files of an existing archive. It is not an option to unpack the archives, one over the other, and then repack the result, because new entries for parent directories of the files in the incoming archives will be created which state their permissions. This is because the state of the directories on the extraction target must be preserved, rather than set to those which were present in the environment where the augmentation took place. So if an archive with duplicate files is presented to pkg_tar where the one most recently added is meant to overwrite, it is actually the original, or first file, which is retained, and thus when unpacking the overwriting behavior is not achieved.

I feel that pkg_tar's behavior in this case should be compatible with GNU tar, in order to support interoperability and leverage from common understanding that it has probably established.

eejayes pushed a commit to eejayes/rules_pkg that referenced this issue Apr 11, 2024
Duplicate path entries are made possible within tar archives, as per
feature request described in bazelbuild#849.

RELNOTES: Duplicate path entries supported within tar archives
eejayes pushed a commit to eejayes/rules_pkg that referenced this issue Apr 11, 2024
Duplicate path entries are made possible within tar archives, as per
feature request described in bazelbuild#849.

RELNOTES: Duplicate path entries supported within tar archives
eejayes pushed a commit to eejayes/rules_pkg that referenced this issue Apr 11, 2024
Duplicate path entries are made possible within tar archives, as per
feature request described in bazelbuild#849.

RELNOTES: Duplicate path entry support within tar archives
@aiuto
Copy link
Collaborator

aiuto commented Apr 11, 2024

Compatibliy with gnutar is not a goal we have been aiming for. We've done a lot of work to explicitly detect duplicate files because they are generally a sign that you didn't lay out the inputs correctly. Can you provide a more concrete example of what you are trying to with laying down multiple tar balls?
Also, can you explain how the allow_duplicates_with_different_content attribute does not address your use case.

@eejayes
Copy link
Contributor Author

eejayes commented Apr 12, 2024

What I am trying to do is combine two tar files, where one of them is regarded as having overwrite priority for all path conflicts. Before going into an example, my understanding of allow_duplicates_with_different_content is that the duplication corresponds to that occurring across dependencies, where the use case I consider is duplication within a pre-existing (dependency) archive. Also given that it's true default means that this option is being exercised when the problem occurs, it does not seem to be a solution. Also given the document, it seems like this is a deprecated interface. However I have not analyzed it's behavior, and could be missing something.

I will try to illustrate the problem here. For some diversity in tar utilities, I also exemplify bsdtars behavior.

Example "conflicting" input archives

$ tar tvf archive_a.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
$ tar tvf archive_b.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues

Concatenate with gnutar

$ tar --concatenate --file=archive_b.tar archive_a.tar
$ tar tvf archive_b.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock

Propagate concatenated archive through pkg_tar

BUILD

load("@rules_pkg//pkg:mappings.bzl", "pkg_files")
load("@rules_pkg//pkg:pkg.bzl", "pkg_tar")

pkg_files(
    name = "gnutar_concatenated_archive",
    srcs = [":archive_b.tar"],
)

pkg_tar(
    name = "duplicate_file_demo",
    deps = [":gnutar_concatenated_archive"],
)

Commands

$ bazel build //:duplicate_file_demo
$ tar tvf bazel-bin/duplicate_file_demo.tar
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:34 ./rock
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:34 ./blues
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:30 ./jazz

Compare unpacked results of gnutar and pkg_tar

$ mkdir inspect_pkgtar_output inspect_gnutar_output
$ tar xvf archive_b.tar -C inspect_gnu_tar_output/
$ tar xvf duplicate_file_demo.tar -C inspect_pkg_tar_output/
$ ls -l inspect_gnu_tar_output/
total 8
drwxr-xr-x 2 s0000184 linuxusers 4096 Apr 12 08:30 ./
drwxr-xr-x 4 s0000184 linuxusers 4096 Apr 12 09:20 ../
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:30 rock
$ ls -l inspect_pkg_tar_output/
total 8
drwxr-xr-x 2 s0000184 linuxusers 4096 Apr 12 09:21 ./
drwxr-xr-x 4 s0000184 linuxusers 4096 Apr 12 09:20 ../
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:34 rock

When unpacking the archive created by tar --concatenate, for paths that are subject to conflict it is the last file, or most recently added file which prevails. This is what the overwrite use case requires. When unpacking the archive created by pkg_tar, the first file in the archive, or in other words the oldest prevails.

bsdtar concatenate and unpack behavior

While bsdtar is different from gnutar including it's CLI, the behavior is roughly the same. The same incoming archives are used in this example.

$ bsdtar c --file b_first.tar @archive_b.tar @archive_a.tar
$ tar tvf b_first.tar 
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
$ bsdtar c --file a_first.tar @archive_a.tar @archive_b.tar
$ tar tvf a_first.tar 
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues

$ mkdir extract_afirst extract_bfirst

$ bsdtar -x --file=b_first.tar -C extract_bfirst
$ ls -l extract_bfirst
total 0
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 rock

$ bsdtar -x --file=a_first.tar -C extract_afirst
$ ls -l extract_afirst/
total 0
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 rock

@aiuto
Copy link
Collaborator

aiuto commented Apr 12, 2024

Thanks for those examples. How about this proposed API

Add allow_dups_from_deps as an attribute to pkg_tar.

  • It works like concatenate from gnutar.
  • The comment should mention that deps are processed in order, so users might need the buildifer no-sort-list option to prevent it from reordering things.
  • default=False

With that behavior, all existing tests should pass without change. A new test could easily be added.
It would be best if you extended the verify_archive test with a content attribute. It would be a string keyed dict of path to content. Sort of like the verify_links parameter. Then 'rock' could have different content in the order or not.

A possible future idea is an option to allow deps processing before srcs (implying allow_dups_from_deps). That way you could start with a baseline .tar from somewhere else and overlay your own srcs on top of it. WDYT?

FWIW.... this is the second time in a month where no-sort-list has come up in a context like this. buildozer needs fixing to shift that declaration to the rule author, not the end user. But this is beyond the scope here.

@eejayes
Copy link
Contributor Author

eejayes commented Apr 19, 2024

Yes that sounds like a good proposal, as it supports this new behavior and doesn't risk breaking the existing API. I don't know if it's a sensible use case for duplicates to be assigned to srcs (i.e. in the same pkg_tar instance). It's more likely that the contents change over time due to modifications. There your suggestion about processing deps first comes into play. This is also good.

FWIW, this behavior originates from the tape drive technology and specification, e.g.: tape archive duplicates.

@aiuto
Copy link
Collaborator

aiuto commented Apr 19, 2024

SGTM. Except my proposal about testing is not sufficient. You need to be able to specify that a file appears N times, and possibly what the content for each is.

FWIW, this behavior originates from the tape drive technology and specification, e.g.: tape archive duplicates.

Well.... if we want to get pedantic, I actually used tape drives before tar - and the pain of dealing with that never leave you. In those times, you usually had no file name at all on tapes. It was just data. More advanced schemes used a label at the beginning, specifying the name and sequence number (so you knew if it was tape 3 of a very large data set). The duplicate file name within the same physical unit was pretty much tar specific because most other backup schemes would rarely append into the existing set of files. Instead, they would skip to end of the written tape, and append a new set of data. That resulted in something like file1 EOF file2 EOF ... fileN EOF EOF fileX EOF, fileY EOF ... EOF EOF EOF. tar improved tape utilization by changing any single set of file EOF groups to a single file,file,file EOF group. You could put multiple tar outputs on a tape, but trying to back up over that last EOF and append into the previous one was a bit risky.

I like to remind people that MapReduce/Flume/Hadoops/Spark ... are not a recent invention. People dealt with data bigger than memory using tapes drives to read/process/sort/merge back in the 1950s.

eejayes pushed a commit to eejayes/rules_pkg that referenced this issue Apr 30, 2024
Duplicate path entries are made possible within tar archives as
discussed in feature request bazelbuild#849. This includes an interaction with
create parents, where the only logical scenario which would require
inference of a parent directory is when one does not already exist.
This is because allowance of duplicates is only useful when explicit
paths are declared.

RELNOTES: Duplicate path entries supported within tar archives
eejayes pushed a commit to eejayes/rules_pkg that referenced this issue Apr 30, 2024
Duplicate path entries are made possible within tar archives as
discussed in feature request bazelbuild#849. This includes an interaction with
create parents, where the only logical scenario which would require
inference of a parent directory is when one does not already exist.
This is because allowance of duplicates is only useful when explicit
paths are declared.

RELNOTES: Duplicate path entries supported within tar archives
eejayes pushed a commit to eejayes/rules_pkg that referenced this issue May 2, 2024
Duplicate path entries are made possible within tar archives as
discussed in feature request bazelbuild#849. This includes an interaction with
create parents, where the only logical scenario which would require
inference of a parent directory is when one does not already exist.
This is because allowance of duplicates is only useful when explicit
paths are declared.

RELNOTES: Duplicate path entries supported within tar archives
eejayes added a commit to eejayes/rules_pkg that referenced this issue May 2, 2024
Duplicate path entries are made possible within tar archives as
discussed in feature request bazelbuild#849. This includes an interaction with
create parents, where the only logical scenario which would require
inference of a parent directory is when one does not already exist.
This is because allowance of duplicates is only useful when explicit
paths are declared.

RELNOTES: Duplicate path entries supported within tar archives
eejayes added a commit to eejayes/rules_pkg that referenced this issue May 2, 2024
Duplicate path entries are made possible within tar archives as
discussed in feature request bazelbuild#849. This includes an interaction with
create parents, where the only logical scenario which would require
inference of a parent directory is when one does not already exist.
This is because allowance of duplicates is only useful when explicit
paths are declared.

RELNOTES: Duplicate path entries supported within tar archives
eejayes added a commit to eejayes/rules_pkg that referenced this issue May 2, 2024
Duplicate path entries are made possible within tar archives as
discussed in feature request bazelbuild#849. This includes an interaction with
create parents, where the only logical scenario which would require
inference of a parent directory is when one does not already exist.
This is because allowance of duplicates is only useful when explicit
paths are declared.

RELNOTES: Duplicate path entries supported within tar archives
aiuto pushed a commit that referenced this issue May 29, 2024
Duplicate path entries are made possible within tar archives as
discussed in feature request #849. This includes an interaction with
create parents, where the only logical scenario which would require
inference of a parent directory is when one does not already exist.
This is because allowance of duplicates is only useful when explicit
paths are declared.

RELNOTES: Duplicate path entries supported within tar archives
@aiuto
Copy link
Collaborator

aiuto commented May 31, 2024

Believed fixed by #850

@aiuto aiuto closed this as completed May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants