Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in merged tars with big file names on Python 3.8 #204

Closed

Conversation

andrewalker
Copy link

@andrewalker andrewalker commented Jul 14, 2020

When adding tarballs as input for other tarballs (i.e. pkg_tar that depends on another pkg_tar), every file with a name that exceeds 98 characters will not land in the new root directory (package_dir attribute), but rather on the actual root.

Consider the following example:

pkg_tar(
    name = "foo",
    package_dir = "/foo",
    strip_prefix = ".",
    srcs = [ ... ],
    deps = [":big-names"],
)

pkg_tar(
    name = "big-names",
    strip_prefix = ".",
    srcs = [
        "ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl",
        "ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl-",
    ],
)

The foo tarball will have the following structure:

.
├── foo
│   └── ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl
└── ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl-

Filenames with 98 characters or less get included in the right directory, but the larger ones go to the root. A deep directory structure behaves the same way:

.
├── foo
│   └── ninety/eight/characters/seems/to/be/the/maximum/limit/for/merged/tarballs/in/the/rules_pkg/for/bzl
└── ninety/eight/characters/seems/to/be/the/maximum/limit/for/merged/tarballs/in/the/rules_pkg/for/bzl-

And the size of the package_dir has no influence: "/f", "/foo" or "/foo/bar/baz/quux" behaves exactly the same way.

This only happens on Python 3.8. The reason is the default format changed from tarfile.GNU_FORMAT to tarfile.PAX_FORMAT. It's also possible to reproduce this by changing the format in archive.py to PAX_FORMAT explicitly.

To fix it, I'm explicitly requesting GNU_FORMAT even on 3.8, so that it continues to work as before.

When adding tarballs as input for other tarballs (i.e. `pkg_tar` that
depends on another `pkg_tar`), every file with a name that exceeds 98
characters will not land in the new root directory (`package_dir`
attribute), but rather on the actual root.

Consider the following example:

    pkg_tar(
        name = "foo",
        package_dir = "/foo",
        strip_prefix = ".",
        srcs = [ ... ],
        deps = [":big-names"],
    )

    pkg_tar(
        name = "big-names",
        strip_prefix = ".",
        srcs = [
            "ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl",
            "ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl-",
        ],
    )

The foo tarball will have the following structure:

    .
    ├── foo
    │   └── ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl
    └── ninety-eight-characters-seems-to-be-the-maximum-limit-for-merged-tarballs-in-the-rules_pkg-for-bzl-

Filenames with 98 characters or less get included in the right
directory, but the larger ones go to the root. A deep directory
structure behaves the same way:

    .
    ├── foo
    │   └── ninety/eight/characters/seems/to/be/the/maximum/limit/for/merged/tarballs/in/the/rules_pkg/for/bzl
    └── ninety/eight/characters/seems/to/be/the/maximum/limit/for/merged/tarballs/in/the/rules_pkg/for/bzl-

And the size of the package_dir has no influence: "/f", "/foo" or
"/foo/bar/baz/quux" behaves exactly the same way.
@andrewalker andrewalker requested a review from aiuto as a code owner July 14, 2020 11:25
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@andrewalker
Copy link
Author

Huh, pipeline passed. I get a broken test on ArchLinux, bazel 3.3.0- (@non-git).

@andrewalker andrewalker marked this pull request as draft July 14, 2020 11:29
@andrewalker
Copy link
Author

It's also worth mentioning that if I change the format to GNU_FORMAT, I don't get the error:

diff --git a/pkg/archive.py b/pkg/archive.py
index f15b425..a2de674 100644
--- a/pkg/archive.py
+++ b/pkg/archive.py
@@ -146,7 +146,7 @@ class TarFileWriter(object):
       # Instead, we manually re-implement gzopen from tarfile.py and set mtime.
       self.fileobj = gzip.GzipFile(
           filename=name, mode='w', compresslevel=9, mtime=self.default_mtime)
-    self.tar = tarfile.open(name=name, mode=mode, fileobj=self.fileobj)
+    self.tar = tarfile.open(name=name, mode=mode, fileobj=self.fileobj, format=tarfile.GNU_FORMAT)
     self.members = set([])
     self.directories = set([])

@andrewalker
Copy link
Author

andrewalker commented Jul 14, 2020

OK, found the error. I'm running Python 3.8, and according to the docs: "Changed in version 3.8: The default format for new archives was changed to PAX_FORMAT from GNU_FORMAT". If I force format=tarfile.PAX_FORMAT on archive.py, it fails consistently everywhere. On my machine that's the default, which is why I can fix it by using GNU_FORMAT.

Up until Python 3.8, the default format used to be GNU_FORMAT. On 3.8,
it changed to PAX_FORMAT, which is supposed to also support big file
names. There seems to be a bug in it, exposed by the test in my previous
commit.

This commit ensures that even with Python >= 3.8 long file names still
work, by explicitly requesting GNU_FORMAT when writing.
@andrewalker andrewalker changed the title Provide broken test for bug in merged tars with big file names Fix bug in merged tars with big file names on Python 3.8 Jul 14, 2020
@andrewalker andrewalker marked this pull request as ready for review July 14, 2020 13:02
@aiuto
Copy link
Collaborator

aiuto commented Jul 14, 2020

See #203

@DhashS
Copy link

DhashS commented Aug 3, 2020

@andrewalker I think you need to sign the CLA to get this merged. We're looking forward to it, as this oneliner makes the behaviour of rules_pkg not break horribly depending on your python version

@aiuto
Copy link
Collaborator

aiuto commented Aug 7, 2020

Sorry I took so long. I wanted to fix the testing of debian packages first. One #211 goes in, I will look at this and #201 and see which to apply.

@andrewalker
Copy link
Author

Hey, sorry I haven't signed the CLA yet. I was going to sign it as part of my company (so the corporate version), but that got a bit delayed in internal bureaucracy. I see #217 was merged, is there still interest in merging the test?

@aiuto
Copy link
Collaborator

aiuto commented Aug 13, 2020

I would like to have some tests to prove behavior. I am behind on this myself. I want to understand what appears contradictory

  • python 3.8 switched a tarfile default to PAX on the grounds that it provides better UTF-8 and long file support
  • our experience with .deb files seems to disagree.

So, what I need is more tests beyond what you have in this CL. And, I don't have python3.8 in our test machines yet, so anything I do is a stab in the dark. Nor do I have a Windows machine in my home office, so some interoperbility there is hard to test. If you have no objections, I would take your test case and use it in a larger set of CLs.

@aiuto aiuto closed this Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants