Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolchain: Build pretty. Use xz, not gz. Use sha256, not md5 #10449

Closed
wants to merge 3 commits into from

Conversation

bleys
Copy link

@bleys bleys commented Oct 12, 2021

We were already using lzma in the clang build, but not specifying it as a dependency. I fixed the dependency specs and switched the downloads in BuildIt.sh to xz instead of gz. The files are almost half the size. Also added a note about potential slowness of the GNU servers and an alternative (I started this edit because my initial download took nearly an hour off the GNU server).

It is safer to use sha256 than md5. Even sha1 is broken now: NixOS/nixpkgs#77238

I fully integrated the buildstep helper into the BuildIt.sh output. Also, printf = better echo (easier to use formatting and doesn't fork another process). Output formatting made so it is easier to follow the build process in general.

@BuggieBot
Copy link
Member

One or more of the commits in this PR do not match the code submission policy, please check the lint_commits CI job for more details.

@bleys bleys force-pushed the build-cleanup branch 3 times, most recently from 6b4834a to 6b0f482 Compare October 12, 2021 17:13
@bleys bleys changed the title Build cleanup: use lzma/xz not gz, sha256 not md5 Toolchain: Build pretty. Use xz, not gz. Use sha256, not md5 Oct 12, 2021
@ADKaster
Copy link
Member

ADKaster commented Oct 12, 2021

One thing to point out about the use of md5. It's not for any security purpose, just a sanity check.

Also sorry about the close/re-open, keyboard slipped on mobile (my bad!)

@ADKaster ADKaster closed this Oct 12, 2021
@ADKaster ADKaster reopened this Oct 12, 2021
@timschumi
Copy link
Member

timschumi commented Oct 12, 2021

In my opinion this could/should rather be split into three commits, since build output doesn't have much to do with the compression algorithm, and the compression algorithm doesn't have much to do with the hash algorithm. It would also make it easier to review and bisect down the line.

@bleys
Copy link
Author

bleys commented Oct 12, 2021

One thing to point out about the use of md5. It's not for any security purpose, just a sanity check.

Thanks for the review. These checksums serve two purposes:

  1. They say, "This file is intact."
  2. They say, "This is the file I intended to put here."

For the second purpose, that relies on the checksum algorithm being used not being easily deliberately confused. MD5 collisions are cheap. Bad actors could takeover any server or MITM you, and serve a file with an equal checksum that has a malicious payload. Same could happen with SHA1 since a couple of years ago. SHA256 avoids this for the foreseeable future. There are some good links in that NixOS issue I linked at top (where they migrated from SHA1 to SHA256 for verifying downloaded packages just like we are doing here, but for their whole giant Linux package repository).

@bleys
Copy link
Author

bleys commented Oct 13, 2021

In my opinion this could/should rather be split into three commits, since build output doesn't have much to do with the compression algorithm, and the compression algorithm doesn't have much to do with the hash algorithm. It would also make it easier to review and bisect down the line.

@timschumi Yes, agreed. I have done that work now. Thanks!

@timschumi
Copy link
Member

Is there a particular reason for replacing all the echo with printf? To my understanding, both are equivalent if you don't make use of the formatting capabilities.

@bleys
Copy link
Author

bleys commented Oct 13, 2021

Yes:

  1. I am using the formatting.
  2. printf avoids forking a new process, so it is more efficient.

@gmta
Copy link
Member

gmta commented Oct 13, 2021

In addition to the MD5 vs. SHA-256 discussion: (most) people are blindly trusting the scripts to download, compile and execute the resulting binaries on their systems, so making sure the attack plane for a malicious actor is as small as possible is really worthwhile here.

@timschumi
Copy link
Member

timschumi commented Oct 13, 2021

  1. I am using the formatting.

The only places where I can see formatting being used is here, while all the other printf calls still use variables directly. At that point I'd suggest converting all of them or none of them.

@bleys
Copy link
Author

bleys commented Oct 14, 2021

In addition to the MD5 vs. SHA-256 discussion: (most) people are blindly trusting the scripts to download, compile and execute the resulting binaries on their systems, so making sure the attack plane for a malicious actor is as small as possible is really worthwhile here.

Yes, exactly. It seems in other places in Toolchain we are already using SHA256 checks (e.g., BuildPython ) so really BuildIt is the straggler anyway. This just makes it all consistent.

Re: gz vs xz. LZMA (xz) seems like a pretty universal win. XZ itself is a tiny package to require (and indeed we already need it for the Clang build) and the file size savings are significant. Many packagers prefer and default to XZ for good reason.

It looks like a good portion of the packages in Ports also use xz, so it's good to get it into our build dependencies for that reason as well. Another good thing to do might be to go in and patch all the packages in Ports which are currently using gz and change them to pull xz files where available.

➜ grep "tar\.gz" ./serenity/Ports/*/*.sh | wc -l
135
➜ grep "tar\.xz" ./serenity/Ports/*/*.sh | wc -l
30

If you are curious about compression algorithms as they relate to packages, this is a good read: Linux OS data compression options: Comparing behavior

A quick summary of that: You should prefer XZ for high compression ratio, and ZSTD for quickness (great for things like on-the-fly compression, but not so relevant for archives). zlib/gzip is antiquated and almost universally under-performs on all dimensions for software packaging purposes.

@bleys
Copy link
Author

bleys commented Oct 14, 2021

While I'm thinking about this: Another patch for Ports might be to not assume we have access to GNU tar's compression detection here: Ports/.port_include.sh

Relevant: https://unix.stackexchange.com/questions/101561/what-are-the-differences-between-bsdtar-and-gnu-tar

@bleys
Copy link
Author

bleys commented Oct 15, 2021

The only places where I can see formatting being used is here, while all the other printf calls still use variables directly. At that point I'd suggest converting all of them or none of them.

I converted the rest of the printf variable inclusions.

@stale
Copy link

stale bot commented Nov 5, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Nov 5, 2021
@stale
Copy link

stale bot commented Nov 12, 2021

This pull request has been closed because it has not had recent activity. Feel free to re-open if you wish to still contribute these changes. Thank you for your contributions!

@stale stale bot closed this Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants