Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for normalizing the archived files metadata #47

Conversation

64kramsystem
Copy link

@64kramsystem 64kramsystem commented Jul 5, 2019

UPDATE: a few important notes if anybody happens to read this Pr:

  1. This is a universal solution, which covers also a concept sometimes forgotten - different O/Ss can produce different compressed streams
  2. This solution causes an incompatibility in some cases, which is solvable, but needs extra logic, and certainly good testing (see Zip files are incompatible with Java's java.util.zip 64kramsystem/terraform-provider-archive-dev#1 (comment)).
  3. There are other, more adhoc, solutions applicable (e.g. granular options); they are easy to implement - as a matter of fact, this PR is a sort of blend of all
  4. I'm not interested in participating to this feature nor to the related discussion, so please don't (manually) CC me; the mainters showed through facts (as opposite to words), uninterest in proceeding with any solution at all; it'd be polite (and good practice) if they just declared so, rather than rushing to say that this will be taken care of. For me, this is poor maintenance, and I don't want to be involved with it.

When archiving files from different machines, even if the source content is the same, different archive files may be generated due to a variety of factors, like compression libraries or file metadata difference (permissions, etc.).

This creates problems in situations where it's expected that the exact same archive should always be generated from a given source content (typically, a Lambda function resource).

An example context that causes differences is Git on different O/Ss - filesystems may have different mount permission masks, and be the same for Git (which doesn't distinguish 0644 from 0664), but end up generating a different archive.

The normalize_files_metadata option makes the archiver produce a "normalized" archive, which solves the problem. In the Zip case, this is obtained by setting:

  • the compression method to Store
  • the modification date to a fixed one
  • the permissions to 0644

Disabling compression is a tradeoff, however, in the context where this functionality is used (eg. Lambda function resources), the difference is negligible.

Close #34.


The functionality is easy and fits nicely with the code. The code itself doesn't have a very string case convention for variable names (some are entirely lower case without underscore, some in camel case, etc.), so that may be subject to change.

The name of the option is entirely open to change. "Normalization" is the concept that I think is fit, but of course, there are different ways of viewing it.

func normalizeCompressingFile(fh *zip.FileHeader) {
fh.Method = zip.Store
fh.SetModTime(time.Date(1981, 4, 10, 0, 0, 0, 0, time.UTC))
fh.SetMode(0644)
Copy link

@bboe bboe Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@64kramsystem 64kramsystem Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with executing binaries inside a Lambda functions; based on a brief research, maybe Lambda resets the permissions of the archived files - see, for example:

It's not 100% clear from those resources where their problem is - it could be the zip library not storing the file permissions (not all do), or Lambda resetting the permissions when unarchiving; if it's the latter case, managing executable bit would not yield any effect.

Do you have direct experience on this use case, so that I know with certainty what's the exact behavior?

When archiving files from different machines, even if the source content is the same, different archive files may be generated due to a variety of factors, like compression libraries or file metadata difference (permissions, etc.).

This creates problems in situations where it's expected that the exact same archive should always be generated from a given source content (typically, a Lambda function resource).

An example context that causes differences is Git on different O/Ss - filesystems may have different mount permission masks, and be the same for Git (which doesn't distinguish 0644 from 0664), but end up generating a different archive.

The `normalize_files_metadata` option makes the archiver produce a "normalized" archive, which solves the problem. In the Zip case, this is obtained by setting:

- the compression method to Store
- the modification date to a fixed one
- the permissions to 0644

Disabling compression is a tradeoff, however, in the context where this functionality is used (eg. Lambda function resources), the difference is negligible.
@64kramsystem 64kramsystem force-pushed the sav-normalize_files_metadata_option branch from 0bfacd0 to 347a1af Compare July 12, 2019 09:46
@64kramsystem
Copy link
Author

64kramsystem commented Jul 12, 2019

Removed the Travis go_import_path setting commit. It was necessary to make the build run on the branch, but it's not necessary anymore now that the a PR has been opened.

@sveniu
Copy link

sveniu commented Oct 10, 2019

Did this fizzle out? I'm eagerly awaiting an update on this or #41.

I'm using a pretty ugly work-around for now, to unify the umask across platforms:

git () {
  origin="$(command git remote get-url --push origin 2>/dev/null)"
  [ "$origin" = "[email protected]:foo/terraform.git" ] &&
    (umask 0022; command git "$@") ||
    command git "$@"
}

@64kramsystem
Copy link
Author

Did this fizzle out? I'm eagerly awaiting an update on this or #41.

I'm using a pretty ugly work-around for now, to unify the umask across platforms:

Hello! Sadly, the upstream maintainer(s) entirely ignored both the issue and the PR - the PR is long standing, and hasn't been commented, closed or accepted, while there's activity on master.

We've been using this archive fork at my company successfully for a long time by now, so likely, the best strategy is to officially release the fork.

I'll close this PR, then give updates on the fork.

@64kramsystem
Copy link
Author

Closing for inactivity.

@callaa3
Copy link

callaa3 commented Jan 9, 2020

It's been an issue for a while for us using 3 different OSs in the office. Any chance we can get some visibility back on the issue?

@64kramsystem
Copy link
Author

It's been an issue for a while for us using 3 different OSs in the office. Any chance we can get some visibility back on the issue?

I think the project is essentially unmaintained (and when it wasn't, this PR was entirely ignored, which is "not cool" 🙄).
I'd like to produce and public my fork, but I don't have resources right now.

Regardless, something important to keep in mind for people needing this functionality is that there is no need for the code to be upstream. Just clone my fork and compile the plugin, then place it in the appropriate location - either in the per-user (see manual) or in the per-project TF configuration directory.

The fork is stable, at least, for our use case - in my company, we've been using it for a long time, and we just keep the compiled binary into our project's TF configuration directory.

@WtfJoke
Copy link

WtfJoke commented Feb 17, 2020

@saveriomiroddi cant you leave the pull-request open so it eventually gets merged or receive some comments from the maintainers?

Which fork do you mean? Yours https://github.com/saveriomiroddi/terraform-provider-archive-dev seems to be "outdated" as well?

EDIT: You probably referr to https://github.com/saveriomiroddi/terraform-provider-archive-dev/tree/sav-normalize_files_metadata_option

@64kramsystem
Copy link
Author

64kramsystem commented Feb 21, 2020

@saveriomiroddi cant you leave the pull-request open so it eventually gets merged or receive some comments from the maintainers?

Hello!

So! I have a few considerations in relation to the PR closure.

First, most important from the practical perspective: the changes applied by the PR are free 😄 This means everybody is 100% free to, say, fork the original repository, copy the commit and open a new PR.

In this case the procedure is:

# fork the original repository from the Web UI (or the GitHub commandline tool)

git clone <forked_repository_address>
cd <local_repository_dir>

git remote add saverio https://github.com/saveriomiroddi/terraform-provider-archive-dev.git
git fetch --all
git checkout -b <feature_branch_name> saverio/sav-normalize_files_metadata_option
git push --set-upstream origin HEAD

Now you can create the PR from the GitHub Web UI (or cmline tool) 😄

Second: repository maintainers can re-open PRs; this happened to a contribution of mine on Bundler. It's crucial to understand that the important thing is that the issue remains open - the typical maintainance workflow uses issues for tracking purposes, and in this case, an issue is existing and open.

Now, regarding the ideological aspect.

I close PRs after a set time because I think it's disrespectful from maintainers to wholesale ignore conversations. I'm perfectly fine if an issue/PR is closed by the maintainers because:

  • it doesn't fit the overall design,
  • or there aren't enough resources to take full care of it,
  • or it's not deemed as needed,
  • or it's deemed as poorly engineered,
  • or the wind blows from north rather than south.

But I find ignoring contributions (conversations and/or PRs) very poor maintenance practice, unless the project is expressly unmaintained.

Most important of all, this maintenance attitude implies that contributors must be thankful to the maintainers if their contributions are accepted.
I find this an extremely important topic. There should be a maintainer's code of conduct, not only the contributors' 😉.

Having said that, back to practice: Terraform plugins are trivial to build and use. In this case, it's just a matter of cloning the fork, building the plugin, and placing it in the appropriate directory. At my company, we keep those binaries in the SCM, so using them is friction-free. No need to chase careless maintainers 😄.

@64kramsystem
Copy link
Author

Which fork do you mean? Yours https://github.com/saveriomiroddi/terraform-provider-archive-dev seems to be "outdated" as well?

EDIT: You probably referr to https://github.com/saveriomiroddi/terraform-provider-archive-dev/tree/sav-normalize_files_metadata_option

I'm not sure what you mean with this; we've been using a compiled version of that branch/PR since it was published, until now - and now we're using Terraform v0.12.19 without any problems, so I think this should work for everybody.

@WtfJoke
Copy link

WtfJoke commented Mar 6, 2020

First, most important from the practical perspective: the changes applied by the PR are free 😄 This means everybody is 100% free to, say, fork the original repository, copy the commit and open a new PR.
In this case the procedure is:

Sure, but I think its not a nice gesture to copy the commit and open a new PR.
And I contributed in the past too, so I'm capable of creating a PR using git, but thank you for the details 😉

Second: repository maintainers can re-open PRs; this happened to a contribution of mine on Bundler. It's crucial to understand that the important thing is that the issue remains open - the typical maintainance workflow uses issues for tracking purposes, and in this case, an issue is existing and open.

I see your point but in my opinion PR can stay open as long as its relevant and up to date to get merged. But I know now, why you closed, that's ok for me as an explanation :)

Now, regarding the ideological aspect.

I close PRs after a set time because I think it's disrespectful from maintainers to wholesale ignore conversations. I'm perfectly fine if an issue/PR is closed by the maintainers because:

You're a bit harsh with that opinion, maintainers life is hard I think.
I agree that contributions in any form shouldn't be ignored completely, at least leaving a comment as a maintainer should be the minimal effort/thing to be done.

Having said that, back to practice: Terraform plugins are trivial to build and use. In this case, it's just a matter of cloning the fork, building the plugin, and placing it in the appropriate directory. At my company, we keep those binaries in the SCM, so using them is friction-free. No need to chase careless maintainers 😄.

But its better to keep changes upstream and forks tend to outdate very fast. But I'm aware of that its trivial to build and use it.

I find this an extremely important topic. There should be a maintainer's code of conduct, not only the contributors' 😉.

In my opinion maintainers are also contributors and should follow these principles. But maybe you are right and there needs to be an extended version aiming for maintainers 😄

I'm not sure what you mean with this; we've been using a compiled version of that branch/PR since it was published, until now - and now we're using Terraform v0.12.19 without any problems, so I think this should work for everybody.

I just wasn't sure which fork/branch you were referring, but I figured it out. So everything fine :) 👍

Thanks for your detailed answer. I got your overall opinion and understand it.
I want to avoid that its gets off topic too much here, in case your PR gets reopened (which would be a great thing).

@64kramsystem
Copy link
Author

Sure, but I think its not a nice gesture to copy the commit and open a new PR.
And I contributed in the past too, so I'm capable of creating a PR using git, but thank you for the details

Sorry if that sounded patronizing - part of my job is to write scripts, so it's kind of a reflex 😬 .

Regarding the discussion as a whole, somebody, on the referenced issue, pinged the maintaners as you were suggesting, which I think is the best idea - hopefully they'll at least participate to the discussion 🙂

To clarify, I'm a maintaner as well (of other projects) 😬, so I understand the problems 🙂

@appilon
Copy link
Contributor

appilon commented May 12, 2020

Hello @saveriomiroddi. Our sincerest apologies for not engaging sooner on the longstanding #34 and for not addressing the proposed fix here, we sincerely appreciate your contributions in the Terraform Ecosystem.

With regards to this PR, after I reviewed it, I do think it would have been hard to approve a generic "normalization" formula. What this PR does reveal though are the areas of an archive that can create differences in the computed hash output. So I think we will look to open up things like file mode, compression algorithm, and modification date as configurable options to help address problems such as #34.

@64kramsystem
Copy link
Author

Hello @saveriomiroddi. Our sincerest apologies for not engaging sooner on the longstanding #34 and for not addressing the proposed fix here, we sincerely appreciate your contributions in the Terraform Ecosystem.

With regards to this PR, after I reviewed it, I do think it would have been hard to approve a generic "normalization" formula. What this PR does reveal though are the areas of an archive that can create differences in the computed hash output. So I think we will look to open up things like file mode, compression algorithm, and modification date as configurable options to help address problems such as #34.

Hello! Thanks for the attention. It makes sense to make the parameters granular and available.

Ultimately, achieving the objective can be "not-complex" (of course, with the assumption the our tested use cases cover a large portion of, or hopefully all, the potential differences between archives), since the parameters involved are only three (compression, modification time and permissions).

// - fixed file permissions.
//
func normalizeCompressingFile(fh *zip.FileHeader) {
fh.Method = zip.Store
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a very sneaky compatibility issue with Java's standard java.util.zip, resulting in:

java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor
        at java.base/java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:311)
        at java.base/java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:123)
        at com.meh.test.UnzipFiles.unzip(UnzipFiles.java:30)
        at com.meh.test.UnzipFiles.main(UnzipFiles.java:17)

The quick summary seems to be: Go's zip uses streaming, meaning it first writes the file entry body, and then writes a so-called "ext" data descriptor with size information. Then, java.util.zip can extract these if they use DEFLATE, since DEFLATE encodes data in such a way that you can know when you have all the data. STORE, on the other hand, has no such encoded info, and Java fails to handle it.

Details:

Impact case: The Amazon ECS (Blue/Green) deploy provider of AWS CodePipeline raises an exception if a normalized zip file is fed to it:

Exception while trying to read the task definition artifact file from: <source stage>

We use archive_file to produce the zip, containing appspec.yaml and taskdef.json files for CodeDeploy and ECS, respectively.

AWS services seem to be heavily Java based, so while we don't see the full Java exception output, it seems likely that this is the problem.

Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

archive_file produces different results on different OSs
6 participants