GitHub archive hash stability #46034

vtbassmatt · 2023-02-01T17:46:46Z

vtbassmatt
Feb 1, 2023

(Edit 2023-02-21) A blog post with our go-forward plan is now live. Doc updates will come soon. Thank you all for your comments and insights.

GitHub will hold the source downloads byte-for-byte stable for no less than a year from today (February 21, 2023). This covers both tarball (.tar.gz) and zipball (.zip) formats.
In the future, if we intend to change either archive format, we’ll provide six months’ notice in documentation, and on the blog and changelog. (If we discover a critical vulnerability in the compression path, we reserve the right to shorten or omit the notice period in order to protect our systems and our customers. We don’t expect this outcome, but you never know.)
We presently have no intent to change either format, as we have a new appreciation for the magnitude of the impact this change would have. In full transparency, there are a few deficiencies we wish we could fix (timestamps embedded in zipballs; dependency on system gzip for tarballs), but for the foreseeable future, we’ll engineer around these minor problems.

Hey folks, I'm the product manager for Git at GitHub. On January 30, 2023, we deployed a change from upstream Git which changed the compression library used by git archive. While the inner contents of these artifacts didn't change, the exact byte layout of the file did, which (obviously) changes any checksum or hash of the archive. This broke a number of ecosystems we didn't expect, so after y'all reported it, we rolled back the change.

I have a pretty good working knowledge of what you're likely using these hashes for. However, it's a lot more powerful if I can use your words and insights directly when I'm influencing changes at GitHub. So forgive me if some of these questions seem a little elementary — I'm trying to channel my "beginner's mind" and take nothing for granted. Can folks provide input on the following:

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.
Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.
Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?
Anything else you want me to know?

†These are the auto-generated tarballs/zipballs on the "Releases" page which don't have a filesize and say "Source code (zip)" or "Source code (tar.gz)". They have URLs of the form https://github.com/$OWNER/$REPO/archive/refs/tags/$TAG.zip (or .tar.gz). Files with a filesize are uploaded manually and are completely stable because GitHub doesn't touch them.

Being fully transparent: we had not intended to hold these source archives stable (in their byte layout / hash) forever. They're generated by (essentially) running git archive on-demand plus a little caching to avoid thundering herds. We expect to be subject to changes from upstream Git, compression libraries, system settings and headers, etc. On the other hand, we've never intended to make them change for any specific reason.

Intent aside, we've sent mixed signals to the community over the years. This is not the first time we've rolled back a change that would have altered the hashes. If we say "these aren't stable" but then spend a half decade keeping them stable, that's confusing.

This comment is a correct rendering of what a GitHub employee communicated in a support ticket. I've reviewed the entire ticket myself internally, and we (GitHub) had a communication breakdown between what engineering intended and what ultimately got shared with the customer. The names were essentially correct ("Repository release archives" are stable while "Repository code download archives" are not), but then we mistakenly put the /archive/refs/tags/$TAG.zip URL in the "stable" bucket. That's the URL for "code download archives", the ones we don't guarantee stability on. (And prior to that, we had in another communication been explicit that they are not guaranteed stable.)

The above ☝️ is not me telling you our plan going forward. It's an attempt at clarifying why this is even a topic for discussion based on what's happened in the past. This very discussion will help GitHub define, commit to, communicate, and ultimately execute on a plan for the future. That plan will include which hashes are stable and which are not (if any), including permanent documentation.

Please be civil, and don't get me punished for (or make me regret 😅) being this open.

Answered by tpudlik

Sep 6, 2024

I figured it out. It's STM's fault. Sorry about the noise here.

STM apparently renamed their repository, from https://github.com/STMicroelectronics/cmsis_core/ to https://github.com/STMicroelectronics/cmsis-core/. The links all redirect, but the top-level folder in the compressed archive now has a different name (cmsis-core, not cmsis_core), so the checksum changed, too.

View full answer

eli-schwartz · 2023-02-01T18:03:31Z

eli-schwartz
Feb 1, 2023

Related to this, I started a discussion on the 30th on the development list for git itself (not github of course): https://public-inbox.org/git/[email protected]/T/

The objective is, w.r.t.

We expect to be subject to changes from upstream Git, compression libraries, system settings and headers, etc. On the other hand, we've never intended to make them change for any specific reason.

to discuss ways for upstream git to offer an actual official guarantee of stability. This could end up meaning that github, in the long run, feels confident documenting that these source archives shall be stable.

0 replies

eli-schwartz · 2023-02-01T18:19:19Z

eli-schwartz
Feb 1, 2023

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I am a core committer @mesonbuild, a modern cross-platform build system for multi-language software, in this case the https://wrapdb.mesonbuild.com component (https://github.com/mesonbuild/wrapdb)

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

We depend on the exact bytes of source downloads from github, for some projects.

For other projects, we depend on the exact bytes of manually uploaded release artifacts, occasionally uploaded to github releases, occasionally uploaded to third-party file storage.

We need the exact bytes, because we hash files for security purposes before running tools such as tar to extract them, then execute the code inside in an automated fashion (specifically, to run their build systems). This enables embedding vendorized dependencies, useful for operating systems where dependencies are harder to acquire.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

We were affected by the change. It resulted in a temporary outage for users, one of whom reported a bug against a specific source download. This led to confusion, and one person submitted a bug report to the project whose source download we were using, asking what's up.

We have no plans to change our thinking/policies/tools -- we are beholden to whether projects themselves manually upload artifacts, and most do not. In an ideal world, projects would have formalized release management processes including dedicated release artifacts, cryptographic authorship signatures (PGP), various odd extras such as precompiled manpages or generated autotools configure scripts, etc. etc. etc... but that's a lot of work, and github source downloads using git-archive backed endpoints are already there, so many projects do not bother. People (such as us) relying on those projects therefore use what is available.

0 replies

mboisson · 2023-02-01T18:19:40Z

mboisson
Feb 1, 2023

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I am part of the EasyBuild community, and team lead for research support for the Digital Research Alliance of Canada (the Alliance, for short).

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

The Alliance supports researchers from across Canada using digital research infrastructure, such as HPC clusters. As part of this function, we install research software on our infrastructure. Since a sine qua none condition of research is reproducibility, we require all scientific software installation to be scripted and reproducible. Since we have limited resources, performance of the compiled applications is key, and therefore we compile almost everything from source with optimizations. To do so, we use EasyBuild, and we require every package installed to have its checksum recorded and verified. This ensures that code installed today is the same which will be reinstalled (if needs be) 2 years from now.

In an ideal world, every single software package would actually do proper releases, with semantic versioning, release notes, with their own tarball, with their own published checksums. Unfortunately, in the world of dealing with research code, that is not going to happen. We therefore need a way to at least know that the code has not changed.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

Yes, we were affected. Code which was compiling in a development environment a few hours before ended up not compiling when building in a pre-production environment a few hours later because the checksums verification had failed. That raised a lot of questions and made us wonder if we were victim of a supply chain/MitM attack.

Anything else you want me to know?

It is in fact not critical that checksums don't change. What is critical is that the end user can easily validate that the tarball being downloaded corresponds to the precise tag/commit/version of the code that is intended to be downloaded. It would in fact be better if there was an API/URL that we could call to verify the expected checksum of the file being downloaded. At the moment, the best we can do is download the file, compute its checksum, and if we download it again later, verify that it has not changed since the version we tested before. Even better would be to store and report all historical checksums of a given file, so that one can test whether the tarball downloaded a month ago corresponds to at least one of the checksums that were recorded back then.

3 replies

tgamblin Feb 2, 2023

@mboisson

It is in fact not critical that checksums don't change. What is critical is that the end user can easily validate that the tarball being downloaded corresponds to the precise tag/commit/version of the code that is intended to be downloaded. It would in fact be better if there was an API/URL that we could call to verify the expected checksum of the file being downloaded.

Using an API to get the current checksum, then immediately verifying with it, is not safe. The point of the checksum is to ensure that the file you are receiving is the same one that maintainers approved. It's a very strong way to verify that.

You are trusting two things when you download a tarball from GitHub:

GitHub (via cert from https download)
The project

With checksums, contributors or maintainers verify that the project provided a specific release at a specific time, and they record it in their package recipe, which is stored somewhere. When you download and verify an archive, you're verifying against that. The reason you do that is so that if the file is compromised over time, you can detect it. And yes, the initial contribution could have the checksum of a compromised archive, but the key assumption here is that most projects are not compromised most of the time. That's where the security comes from.

With the API model you propose, you are essentially trusting GitHub twice, and you're not verifying anything against the project at all. GitHub may not know that a malicious maintainer has taken over a project, and a Github API is going to tell you the current checksum for that tarball regardless of what has happened to the project.

Using this API approach invalidates the check -- You might as well not have a checksum at all, and just accept whatever Github sends you over https.

eli-schwartz Feb 2, 2023

Mostly.

The checksum does tell you one thing -- it tells you that you didn't suffer a network hiccup resulting in downloading truncated bytes, i.e. the type of error that would be responded to with "delete it and try again". Actually pretty unlikely to occur in practice, since there are other ways to verify the same thing.

And yes, the initial contribution could have the checksum of a compromised archive, but the key assumption here is that most projects are not compromised most of the time. That's where the security comes from.

Also even with Trust On First Use, there are actually people who diff between versions by comparing the last release's tarball to the new release's tarball. :) Yes, the number of people who do that is relatively uncommon...

On a similar note, if I see changed hashes for an existing tarball my instinctive reaction is to whip out diffoscope to find out why.

tgamblin Feb 2, 2023

The checksum does tell you one thing -- it tells you that you didn't suffer a network hiccup

TCP already checksums so I don’t see this adding much over the https download, when used as described here.

BillyONeal · 2023-02-01T18:47:15Z

BillyONeal
Feb 1, 2023

Hi there, @vtbassmatt !

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I represent vcpkg, https://github.com/microsoft/vcpkg (and https://github.com/microsoft/vcpkg-tool ), which is a source based package management system for C++. We, the Visual C++ team, did NuGet for C++ back in 2012/2013, and that effort did not work: it turns out doing binary deployment for an ecosystem that tends not to care about ABI stability required an impossible cross product of prebuilt bits to know that the resulting package would work. Instead, we built vcpkg which aims to have the same goals (a collection of libraries that can work together) but operates on sources, so people can rebuild their dependencies if necessary.

I, along with @JavierMatosD @vicroms @ras0219-msft @dan-shaw @markle11m and @AugP are core maintainers and implementers.

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

We rely on the precise bytes in order to deliver verifiable "no changes means no changes". If you have a vcpkg distribution and install the same bits again 5 years later, we consider it an important guarantee that you do in fact get the same bits you built before. It is somewhat common for folks to change the ref to which a tag is attached, for instance, and we want to be able to detect this.

Moreover, we also have, for continuity-of-business or supply chain audit reasons, an asset caching feature that can optionally sit between the build environment and all external systems like github, which caches all sources which are fetched. The interface of that is user replaceable and effectively looks like:

Optional<The Source Data> try_fetch_from_cache(SHA512)

because different businesses wanted to be able to supply their own caching mechanisms to which we, the vcpkg team, do not have access or experience.

We also prefer to download tarballs rather than git clone, because last time we did this testing (3, 4 years ago now?), git clone was horribly slow: You either did a full clone and paid the bandwidth cost of transferring all history, or a shallow clone where negotiation between client and server to decide which packs to send took forever. However, that seems to have gotten better over time, see below.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, [...]

We were extremely affected. Of approximately 2000 ports in our catalog, approximately 1500 of them became un-installable if one did not have the sources in their asset cache. Our CI system has an asset cache turned on, so that kept working, but many customers immediately reported problems. See microsoft/vcpkg#29288 plus about 14 duplicates filed in that 12 hours despite pinning that issue. Moreover, even if we updated the SHA to the new value, all existing versions of ports available in the system whose source was hosted on GitHub would have effectively become broken forever. (We have fairly vague plans for how we would insert a version in the 'middle' but never expected to need to fix ~1500 at the same time)

We experienced similar breakage before when the "pax_global_header" thing was changed, but that was tens of relatively obscure ports that changed, not the entire catalog.

I spent most of that day trying to author changes to update all 1500 ports before the revert was announced.

[...] has it driven any change in your thinking/policies/tools in the meantime?

This shows that we are going to need to build some sort of feature that allows hashes to be replaced without invalidating versions. We are also likely to consider fast tracking proposals that move source references out of portfile.cmake and into more declarative places that are easier to mechanically audit and update.

I did some experiments and git clone --depth 1 --single-branch no longer seems to suffer from the ruinously slow negotiation times between server and client. It is now only about 5% slower in my testing than downloading the tarball:

I don't know if this negotiation time cost is affected by the number of files or entities in the repos though.

As a result, we may also reconsider our stance to prefer GitHub source archives and prefer commit SHAs instead if we can come up with an effective asset caching strategy.

Anything else you want me to know?

Taking off my vcpkg maintainer hat, speaking personally, I don't 'fault' GitHub for anything that happened here. A change was made to git, you updated git, and this happened. I do have one thing to fault GitHub for though, which is that the UI for the source bundles looks like any other asset on a release page. It is highly surprising that everything in the assets block is "safe" but the last two entries can randomly change on you:

I would really like a statement that, modulo bugs, the formats will be made reproducible, but I understand there are serious technical challenges in delivering that:

it's impractical to store a tarball for every commit or tag (some projects have tens of thousands of tags)
there are many stakeholders building bits on top of each other (GitHub/git/gzip) that make delivering on such reproducibility guarantees hard.

If, as a result, you can't guarantee reproducibility, even for named releases like this, I would really like seeing not reproducible source blocks set apart in some way to demonstrate that these are clearly different in character to other assets.

Thanks to @Neumann-A for letting me know that this thread exists.

2 replies

grafikrobot Feb 2, 2023

If, as a result, you can't guarantee reproducibility, even for named releases like this, I would really like seeing not reproducible source blocks set apart in some way to demonstrate that these are clearly different in character to other assets.

It seems rather straightforward to guarantee reproducibility.. Only use the new archive gen algorithm on new releases/tags. And hence use the old one on existing releases/tags.

BillyONeal Feb 2, 2023

Only use the new archive gen algorithm on new releases/tags.

That's not easy because the archive generation algorithm itself does not guarantee reproducibility. If there were a CVE against gnu gzip they would have to take that patch and if that changes the output even a tiny bit we are screwed.

dvins · 2023-02-01T18:55:51Z

dvins
Feb 1, 2023

As a downstream consumer our build processes across our entire organization were down for upwards of 8h because upstream parts of the Bazel ecosystem rely on them.

Intent aside, we've sent mixed signals to the community over the years. This is not the first time we've rolled back a change that would have altered the hashes. If we say "these aren't stable" but then spend a half decade keeping them stable, that's confusing.

You've effectively hit it on the head -- the hashes must remain stable not withstanding a well communicated long period of notification of future breaking change to be rolled out. And to be quite frank, the reasonable expectation (as they are release/tag artifacts )is that the checksum has always been and should always remain stable. It's supposed to be a stable artifact, why should the hash change.

Granted, the behind the scenes technical reasons are in conflict. And the problem may be that this mechanism is being used in lieu of proper use of some other package publication system. But large ecosystems are in fact relying upon it.

0 replies

giordano · 2023-02-01T18:58:52Z

giordano
Feb 1, 2023

Thanks for opening this discussion, and the communication throughout the incident.

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I'm a core developer of BinaryBuilder (repository on GitHub)—building framework mainly used in the Julia ecosystem—,and the main maintainer of Yggdrasil—the largest collection of build recipes for this framework (over one thousands, and about 200 of them using GitHub-generated archives).

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

BinaryBuilder strives to have reproducible builds following well-established practices. With a given build recipe, all users using the same version of BinaryBuilder should be able to produce a bit-by-bit identical output tarball for a given target platform. Build recipes specify different types of sources to build packages, including compressed archives (tarballs, zipballs, etc.) or generic files. Archives and generic files have an accompanying SHA256 checksum, to ensure integrity and security of the download. An unstable checksum makes a build recipe simply not work, as we can't verify it the download. Having unstable hashes means that archives generated automatically by GitHub are unreliable, so all recipes we maintain in Yggdrasil won't be able to use them because they could be unreproducible at a later point in time.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

The immediate impact for the BinaryBuilder ecosystem was limited, we discovered it by following the discussions in other packaging ecosystems. We are now banning the use of /archive/ and archive/refs/tags/ archives generated by GitHub in new recipes. We now recommend users to choose instead a release artifact, when available, or a git repository as source for the build recipe. However this only addresses the problem for future recipes: already existing recipes will not work as they are in the future if the checksums will change (about ~20% of the total in Yggdrasil).

Anything else you want me to know?

As a JuliaLang/julia committer I'd also like to point out that some GitHub archives, with checksums, are referenced in Julia's build system: JuliaLang/julia#48466. Should past checksums become invalid, it won't be possible to compile older versions of Julia fully from source.

I think as a maintainer of a packaging ecosystem I'd like clarity of the stability policy (most information I found about this was in some GitHub issues or on Twitter), and a larger notice to inform downstream project of changes that may affect them and give them the time to take appropriate measures.

0 replies

nektro · 2023-02-01T19:31:10Z

nektro
Feb 1, 2023

As a consumer, I was disappointed (while also very much understanding why) that these package systems experienced the disruption they did. We should all want the latest and greatest compression methods so long as we still get out bytes back at the end. This is why it saddened me to see that many of these systems were using the hash of the tarball rather than the hash of the file contents. I hope this change is the push they need to switch over their hash calculation.

11 replies

tgamblin Feb 1, 2023

Yep. It's not even necessarily safe to allow arbitrary code execution in a sandbox. e.g., that can enable someone to run code inside your firewall and attack internal machines even if they cannot hurt the build host (which isn't a guarantee either).

I have seen projects that do not even use sandboxes considering nar as a result of the community thread from a couple of days ago, so I think it's very important to point out the risks of relying on archive content hashes for security. Hashes like nar were designed to ensure reproducibility -- not security.

nektro Feb 1, 2023

You want to know you have the same tarball with reasonable certainty before you extract it.

My response here was a fundamental disagreement with this claim.

There shouldn't need to be special logic for a tarball vs. any other file type.

A little complexity can go a long way in stability and robustness

GitHub is free to use these new compression methods for any new tarballs. Once a tag is created however its checksum should not change.

These statements contradict themselves if you've read how the tarballs are generated

There is a misconception in these threads that it's safe to rely on content hashes of compressed archives. To perform a content hash, you first have to expand the archive. That means that if there are any CVEs in the decompressor (e.g., tar, gzip), you're vulnerable to malformed input attacks.

Which is precisely why fuzzing is so important and then the bugs get fixed.

Systems like nix mitigate this somewhat with a sandboxed expansion environment, but it still opens up a vector for arbitrary code execution.

The content hash can unzipped and streamed entirely in memory to the hash algorithm before any bytes are written out to disk.

nekopsykose Feb 1, 2023

Which is precisely why fuzzing is so important and then the bugs get fixed.

preventative searching measures do not exhaustively prove a lack of bugs. not doing this (this being nar-hashing-esque things) at all though, does. this is essentially a "just prove the extraction is bug free lol" cop out, and not really viable otherwise.

The content hash can unzipped and streamed entirely in memory to the hash algorithm before any bytes are written out to disk.

whether it is written to disk or not has no relevance to the potential arbitrary execution aspect of the unpacking itself.

quyykk Feb 1, 2023

These statements contradict themselves if you've read how the tarballs are generated

no? It's as simple as having two different git versions and using the correct one to create the archive.

SanjayVas Feb 1, 2023

@nektro

There shouldn't need to be special logic for a tarball vs. any other file type.

A little complexity can go a long way in stability and robustness

You're talking about changing the way that multiple build and packaging systems work rather than fixing a single missing step in the wider ecosystem.

In general, requiring everyone to change what they're doing is a monumental ask. This includes asking people to change build systems.

GitHub is free to use these new compression methods for any new tarballs. Once a tag is created however its checksum should not change.

These statements contradict themselves if you've read how the tarballs are generated

My statement is about what GitHub should do, not what it currently does.

jave27 · 2023-02-01T19:44:52Z

jave27
Feb 1, 2023

As a consumer of Conan.io's Conan Center recipes, I can vouch that this change affected our organization mildly. Conan relies on sha256 checksums manually entered into each recipe for each version of the package being built.

We hit the issue early when attempting to build the boost recipe, which many other recipes depend upon. One of boost's pre-requisites was b2, which was affected by the Jan 30th change, so 3 of us were investigating before discovering it was a more widespread problem. Overall, not much developer time was lost (~6 person hours), but it was a disruption nonetheless. We saw the notice that the change would be reverted, so we pushed out our work by a couple of days.

I've started a conversation in the Conan project about future mitigation plans if something like this were to occur again, but at the moment, it would require a lot of manual work, similar to the vcpkg situation. I completely understand the mechanism and don't fault Github for this event, but the downstream effects were very impactful. Thank you for the quick reversion, it saved a lot more headaches. I'll assist in the future-proofing of Conan where I can, as I can envision similar events in the future.

This just further illustrates how fragile the web ecosystem really is.

0 replies

yann-morin-1998 · 2023-02-01T19:57:15Z

yann-morin-1998
Feb 1, 2023

@vtbassmatt Thanks for asking for feedback, it's greatly appreciated!

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I'm a co-maintainer for the Buildroot buildsystem, that helps in building linux-based firmwares for embedded devices (in the spirit of OpenEmbedded/Yocto, OpenWRT...).

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

Buildroot does depend on byte-level tarballs that are generated by Github.

We use it to ensure that the archives that get downloaded are pristine, to avoid various attacks (like MITM). Our trust-model (if we can call it that) is based on TOFU: the first time we get a tarball, we assume it is pristine, and hash its content. Later downloads should match that hash; if they don't, it is a security risk (either the new one is maligned, or the original one was).

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

We were affected in two ways.

First, our autobuilder farm started to notice download issues. Our autobuilders continuously run random configurations (amounting to several hundreds to a thousand a day), each building a lot of packages downlaoded from various locations, with quite a substantial part being on Gituhb; we do have a local cache of archives to avoid redownloading the same archives over and over, but that cache is partially pruned (by only just 5 archives at a time!) at the start of each run to check that the upstream is still reachable and that they have not changed. We however did not notice those failures yet as we only get a daily report of the failures. In the meantime...

Second, when reviewing submissions, the maintainers started to notice some discrepancy between the hashes from the submissions, and those that they were getting locally. At around the same time, some users started reporting download failures due to hash mis-match. After a bit of back-n-forth, we pinpointed it to the autogenerated archives from Github.

Thanks you again for asking for our feedback; that is much appreciated! :-)

Regards,
Yann E. MORIN.

0 replies

kentonv · 2023-02-01T20:41:34Z

kentonv
Feb 1, 2023

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I am a user of the Bazel build system, but not a developer of Bazel itself. (Specifically I work on workerd.)

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

workerd has dependencies on several other projects on GitHub. I'll focus on Cap'n Proto as an interesting case, but this applies to other dependencies as well.

In our Bazel WORKSPACE file, we instruct it to download Cap'n Proto using the auto-generated archive from https://github.com/capnproto/capnproto/tarball/<commit>. Bazel checks the downloaded tarball against a SHA-256 sum to verify it's exactly the one we intended. We feel the hash is important to really prove our builds are reproducible.

Our team also owns and maintains Cap'n Proto itself, and it is very common that we make a change in Cap'n Proto which we immediately want to use in workerd. As a result, it is impractical for us to rely on release tarballs from Cap'n Proto, as this would mean that any time someone needed to make a Cap'n Proto change, they would be saddled with doing a release before they could use the change in workerd. (Cap'n Proto's release process involves doing extended testing on many different platforms that are not relevant to workerd, and it's quite common for some of the exotic platforms to be broken an Cap'n Proto's git head.)

Instead of downloading a tarball, we could instruct Bazel to perform a git clone of the dependency instead. However, in general, this performs much worse, as it downloads history that won't be used, and I imagine it doesn't benefit as much from caching. (In theory shallow clones can be used to avoid downloading more than needed, but they are surprisingly painful to configure and keep updated.) I would imagine that GitHub would actually prefer that CI builds and whatnot download the extremely-cacheable tarballs rather than perform git clones, though if that's not the case, that would be interesting to know!

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

Our CI builds were broken for several hours.

We are considering setting up our own cache for these artifacts which caches them permanently, so once the bytes have been fetched once they won't change. This will take some effort, though, that we'd love to avoid. It's also not clear how community members outside our core team would interact with the cache securely.

Anything else you want me to know?

Googlesource randomizes bytes in their tarballs exactly to prevent people from depending on checksums being stable. As a result, we have been forced to use the git-clone approach for these, but we generally don't like doing so (see above).

It seems to me that GitHub is stuck keeping hashes consistent for all commits that currently exist. There's just to way to get away with that without breaking a huge number of builds. Particularly intractable is the problem of transitive dependencies -- if I depend on foo, and foo depends on bar, and all the hashes change, I am dead in the water until foo updates their build (which they may never do) or I bite the bullet and fork foo (which I really don't want to do).

But it's still possible to make changes that apply only to tarballs of future commits, if the change is rolled out before the relevant date cutoff. Not sure if it's worth it, but it seems like an option.

Ideally Bazel would support specifying a hash of the (canonicalized, somehow) unpacked content rather than a hash of the compressed tarball.

0 replies

tgamblin · 2023-02-02T00:02:00Z

tgamblin
Feb 2, 2023

@vtbassmatt: Thanks for all your work on this, for being so open, and for considering the input of packaging communities!

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I'm the original developer and lead for @spack. Spack is AFAIK the most widely used package manager for high performance computing (HPC). It's used broadly in the worldwide HPC community, including U.S. and other national laboratories. It’s known to a lesser extent in the C++, Python, and scientific computing/ML communities. The project has over 1,100 contributors (over its entire lifetime).

I'm not part of a formal security committee, but as part of building out infrastructure for Spack, we have to make a lot of security decisions around CI and open source -- both in cloud CI and in more sensitive environments that consume a lot of open source. We have to consider mirroring to air-gapped environments in addition to environments with access to the internet.

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

Spack builds packages from sources, including source archives, binaries, patches, and other artifact downloads. We rely on stable checksums of those downloads for security and reproducibility.

The contribution model in Spack is:

Contributors submit PRs that add to a common repository of package recipes on GitHub (much like homebrew-core, nixpkgs, or vcpkg). Packages contain URLs of source archives and checksums to verify them.
Maintainers review the code in PRs. They may validate checksums of packages, or rely on CI to do it.
Users may maintain and distribute their own recipes, which likewise have URLs and checksums in them.

Our security model is essentially that users must trust the project maintainers. Users should only be expanding archives or building source that maintainers approved. Likewise, if another user gave them a package recipe, they trust the checksums in that recipe because they trust that user.

So, before building a package recipe, we:

Fetch the source code archive
Use checksums to verify the downloaded archive. This ensures:
a. that the source we’re building is the same source that was approved by maintainers; and
b. that we are not vulnerable to malformed input attacks (e.g. if we somehow received a malicious tarball).
Do the same types of checks on patches and other resources.

We also allow contributors to specify source to check out by git commit, but we prefer SHA-256’s on archives for a few reasons:

Git clones are slower than archive downloads.
Precise git commits are SHA-1 hashes. NIST has recommended moving away from SHA-1 and using at least SHA-256. This isn’t an issue right now, as there are no known second preimage attacks on SHA-1, but we want to be future proof. As you probably already know, Git is also trying to develop SHA-256 support.
A tarball or other archive can be mirrored easily, even across an air gap, and it can be validated with the same checksum on the other side. To mirror a git clone across an air gap, we have to create an archive from the clone, and we do not know the hash of the archive directly from the package recipe (we only have the commit hash). This means we can’t validate the archive before we expand it on the other end, which again opens us up to malformed input attacks.

Were you affected by the change on January 30th?

Yes.

If so, how did you notice it?

We noticed the change because users started seeing checksum failures in their downloads from GitHub. We got a number of complaints on GitHub and in Slack about hashes no longer being valid for certain projects.

We maintain a mirror of (nearly) all sources in all packages at https://mirror.spack.io. This is an AWS CloudFront distribution of a big S3 bucket that we update regularly, but not immediately, after contributors add new versions to packages. Thankfully, the mirror mitigated many of the issues and users were, for the most part, not dead in the water due to this change. We currently have a cronjob that updates the mirror once a week or so, so we noticed when users started noticing download errors for packages that were not yet in the mirror.

Spack also caches downloads locally, which would have mitigated the problem further for people rebuilding things (this happens a lot in Spack, e.g., to build with different flags or options).

What impact did it have?

Some contributors began to submit PRs with the new hashes, and we began to think about how we would re-hash all the archives in Spack. There are around 7,500 GitHub /archive/ URLs in Spack right now, so we would’ve needed to download all of those, verify that their content was equivalent to previous archives, and re-calculate their checksums. That could take a very long time, and we don’t have tooling ready to do it at a moment’s notice. Also, many users rely on previous releases of Spack, and we probably would’ve needed to apply the hash changes to those release branches to allow them to continue working — in addition to the tip of develop. Users who rely on fixed Spack commits and couldn’t update to the latest recipes would’ve stayed broken.

The number of packages and archives in Spack grows a bit every year, so this burden will increase a lot over time.

Fortunately, since the change was reverted, we did not have to do the big re-hash, and we hope that we will not have to. We did have to revert the commits from contributors who tried to get on top of this issue early, which is not the greatest incentive for folks who try hard to keep their packages up to date.

We have had to re-hash all of Spack in the past — specifically the last time GitHub did this in 2017 or so. We have been stable since then. There were only 728 tarballs to deal with then, so we’ve grown by more than 10x. You can find the issue where we first discussed this here, along with links to other affected communities:

Github Archive URL checksums have changed spack/spack#5411

Has it driven any change in your thinking/policies/tools in the meantime?

It has made us think about building reusable tooling for doing hash changes, and it has led us to consider what else we could be doing besides relying on GitHub archive URLs.

One option is to use stable release artifacts when they are available. GitHub allows maintainers to upload tarballs to releases, and these are guaranteed to stay the same over time. As others here have mentioned, only a minority of projects actually use that feature. I think that because GitHub makes it so easy to generate a source tarball on the fly, and the dynamic archive links are listed with every software release, most maintainers assume that’s a good way to provide their release artifacts.
Another option would be to use git checkouts with precise commits. As mentioned above, git checkouts are slow and hard to mirror and verify over an air gap. So we try to avoid this option.
We plan to add automation to update our mirror immediately after merges, but we have users who, for whatever reason, don't want to use the mirror. If they unregister it they'll still hit this issue when downloading from the original URLs. Also, the mirror only mitigates the issue for our main package repository. It does not help users who are maintaining their own package recipes.
There are a bunch of people in these discussions advocating for content hashes like the nar utility that nix uses. We will not use this approach because it is not safe to rely on content hashes of compressed archives. To perform a content hash, you first have to expand the archive. That means that if there are any CVEs in the decompressor (e.g., tar, gzip), you're vulnerable to malformed input attacks. Systems like nix mitigate this somewhat with a sandboxed expansion environment, but it still opens up a vector for arbitrary code execution.

We must validate the bits before looking at the contents. I think it’s dangerous that people are advocating for content hashing so strongly, and I have seen projects that do no sandboxing considering nar because of these discussions without understanding the risks.
If hashes are just going to remain unreliable, we could save both the archive checksum and the nar hash (or git tree hash, etc.), then use nar to accelerate the process of changing hashes en masse. It would allow us to quickly verify that new archives have the same contents as old archives without fetching the old archive (or tediously finding/digging up/re-creating it, since the issue is that it's not on GitHub anymore). We would still need the archive hash to validate downloads for security, and using two hashes adds extra complexity for packagers. Package ecosystems really need to minimize barriers like this in order to sustain their contributor base, so this is far from ideal.

So, unfortunately, there are not a lot of good options other than archive URLs. We have to consume what upstream projects provide, and we cannot practically push 7,500 upstreams to do the “right” thing, and we’re not going to compromise on supply chain security, especially since our project is used to build open source in air-gapped environments.

Anything else you want me to know?

Archive URLs are not the only thing we need stable hashes for. Essentially any artifact used in a build should be checksummed. Commit patch URLs are another place where this comes up. The default commit patches generated by GitHub use only minimal commit hashes for the git repo at the time they’re fetched. You have to add ?full_index=1 to the URL to get a full commit hash in the patch, otherwise the number of characters used to write the commit in the patch will grow as the project grows.

As far as I can tell, this is not documented anywhere on GitHub — we found out by looking at Homebrew’s recipes. We adopted Homebrew’s practice of requiring contributors to use full_index=1 in GitHub patch URLs. I am not sure how they knew to do this — they may have inside information b/c their lead maintainer works at GitHub.

See this PR for details:
- Use stable URLs and ?full_index=1 for all github patches spack/spack#29239
It would be nice if there were a guide to tell us what packagers can and can’t rely on, and how to ensure stable hashes where available.
If you cannot guarantee that all archive URLs are stable, we would be mostly happy with just tagged releases. Another idea would be to do something like ?full_index=1, but for archive URLs. Package managers like ours could request the URL with, say, ?stable=1, and you could ensure stability only where it is requested. This might help to reduce and track the number of archives you need to cache, and it would give distributions the flexibility to point to an arbitrary commit archive if needed.
If it isn’t obvious, I think stable hashes are very important for supply chain security and for ensuring that we are providing our users with correct, secure software builds. Given the number of projects that rely on this and the lack of good alternatives, I think GitHub can make a real difference for supply chain security (and for the sanity of package maintainers) by taking action here.

0 replies

Flamefire · 2023-02-02T07:59:38Z

Flamefire
Feb 2, 2023

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

Core comitter/contributor of EasyBuild and co-maintainer of a national HPC system at Dresden, Germany

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

We use EasyBuild at the HPC center to install software. Some are downloaded from Github via the release URLs. The whole download&install process is automated and tested by different people. Only then a "recipe" (for the automation) is included in EasyBuild releases which gets then used to install software on the HPC system using privileged accounts.
The checksums included in the recipe are used to verify that the software downloaded from Github matches what was tested before and results hence in the expected installation. It also provides a layer of security against malicious modifications.
Finally downloaded tarballs are cached locally and reused when required so that a download does only occur if the cached tarball does not exists or has a wrong checksum.
Not having a (stable) checksum would make all of that impossible.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

Yes: When installing software where a cached tarball is not available (e.g. when using a recipe/software that was never installed on that system before) the verification step after the download from Github failed.

This has happened before also in other distribution systems (CRAN checksums of R packages may change, maintainers of projects on Github reusing tags, re-releases on PyPi using the same version, ...) so we have a way to specify alternative checksums after verifying the contents are still the same. But that is cumbersome to do large-scale and requires access to the "old" version to check for such differences which may be impossible.

An alternative we are considering are "NAR checksums" used by Meson which checksum the extracted contents rather than the archive itself and hence is immune against changes to the compression algorithm used. However this at least introduces additional complexity.

3 replies

eli-schwartz Feb 2, 2023

An alternative we are considering are "NAR checksums" used by Meson

Meson core committer here.

Meson does not use NAR checksums, but NixOS does.

Meson is never going to change to use NAR checksums, although a NixOS person did suggest it to us. We consider NAR checksums to be dangerously insecure unless extensive care is taken to airgap and sandbox the environment used to calculate it. See above discussion between me and @tgamblin.

You're welcome to use it if you feel it solves your problems (or aren't worried about the security ramifications), though. :)

boegel Feb 2, 2023

re-releases on PyPi using the same version

This one is actually not possible: PyPI hard blocks re-releases under the same version (which is a good thing) - cfr. pypa/packaging-problems#74.

But we have indeed seen re-releases in CRAN (see here or here), or release source tarballs in GitHub being changed by the project (see here), or also in other places (see here.
We've even hard cases where a tarball was changed in-place, and then later restored again to the original, causing quite a bit of confusion (see here + here).

boegel Feb 2, 2023

Our discussion regarding whether or not to use a NAR hash (we shouldn't) is here

mosteo · 2023-02-02T08:20:52Z

mosteo
Feb 2, 2023

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

Lead maintainer/core committer of the Ada package manager Alire.

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

No, because we researched the issue when we were deciding how to distribute our tarballs, and found that there was no guarantee on their immutability. However, this would have been our preferred way to distribute tarballs otherwise, and having to upload alternate files to releases introduces some unneeded duplication and complicates the publishing workflow when tarballs are involved (we have alternatives using git commits).

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

Not that we have found, although some contributors may have disregarded our guidelines for providing source archives and used the ones from GitHub anyway. We have no reports about this yet though.

Anything else you want me to know?

In the beginning we assumed these files (in the case of releases) would be generated once and thus inmutable, until we found otherwise. Just so you know what was our initial frame of mind.

0 replies

Wyverald · 2023-02-02T13:06:01Z

Wyverald
Feb 2, 2023

First of all, thank you for the clear and honest communication! It's really refreshing to see :)

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I work on Bazel (https://bazel.build/), specifically on external dependencies. I'm on the core Bazel team. (cc other team member @meteorcloudy, major community contributor @fmeum, and security committee member @mihaimaruseac)

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

Bazel itself just asks for a URL and a checksum, and so far a project has had to provide URLs and checksums for its dependencies. The community at large has often used the source downloads (instead of user-uploaded archives) as the URLs, and thus formed a direct dependency on the checksum being stable. Furthermore, the new Bazel Central Registry (https://bcr.bazel.build/) stores these URLs and checksums centrally (similar to vcpkg)

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

Yes -- multiple users reported CI breakages. This was especially painful because we had recommended users to treat the refs/tags/... URLs as stable due to the earlier miscommunication you mentioned. We're now thinking about a change in policy to ask users to always upload release archives themselves.

Anything else you want me to know?

Most points have been eloquently explained by other posts in this discussion, so I won't rehash them. I'd just like to be looped into further discussions and announcements. Thanks again!

0 replies

boegel · 2023-02-02T20:51:00Z

boegel
Feb 2, 2023

@vtbassmatt Thank you very much for soliciting feedback from the broader community in an open fashion like this, much appreciated!

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I am the lead developer/BDFL of EasyBuild, a tool that facilitates the installation of (scientific) software packages (usually from source), on High-Performance Computing (HPC) systems, a.k.a. supercomputers.
EasyBuild is used by hundreds of HPC sites around the world, and has received contributions from well over 400 unique contributors over the last decade.

We have no formal security committee in EasyBuild, which is large community/volunteer driven, but we try to do what we can, which includes making EasyBuild verify a SHA256 checksum of every source file it downloads/uses before unpacking it.

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

EasyBuild maintains a local "source cache", so it only downloads a file once (only when it's not in the cache yet).

We record a SHA256 checksum for every file that EasyBuild downloads/uses, and verify the checksum before unpacking/using that file.

The reason for this is multi-fold: not only as a security measure, but also to prevent accidentally using incorrect source files (a corrupt file due to partial download, a somehow different source tarball that the installation was tested with, etc.).

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

On Jan 30th, several people starting reporting that EasyBuild was producing checksum errors for files it was downloading.
After some initial confusion, we pinned it down to source tarballs being served by GitHub via .../archive/..., and someone eventually pointed out https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/, which came as a total surprise to us.

This wasn't our first changing-checksum-rodeo - this has sadly become an almost weekly occurrence, albeit for a variety of different reasons: re-releases in CRAN (see here or here), release source tarballs in GitHub (uploaded by project maintainers, so not created on-the-fly by GitHub) being changed by the project (see here), or also in other places (see here).
We've even hard cases where a tarball was changed in-place, and then later restored again to the original, causing quite a bit of confusion (see here + here).

This led us to adding support in EasyBuild for listing not just one but multiple valid checksums - if we know that a source tarball was changed in place but with the exact same (code) contents, we accept both the old and the new one.

Anything else you want me to know?

We have worked through changing source tarballs served by GitHub before, back in Sept'17 for example. Although the effort required to mitigate the "broken" checksums was relatively limited back then, it certainly would be a lot bigger today: while we roughly had about 2,000 unique software "titles" (not including different versions) supported in Sept'17, we have about triple that today, and I'm confident that we're downloading a lot more stuff from GitHub today than we were 5 years ago (I don't have the numbers readily available on that, but I can figure that out if it would be useful).

Moreover, it would render existing EasyBuild releases (which include the SHA256 checksums for all source files EasyBuild knows about) largely useless, since they would be mostly be a Russian roulette tool w.r.t. verifying checksums, and we would need to scramble to:

i) find all broken checksums, check the newly served source tarballs for source code differences, and fix the checksums;
ii) develop some tooling to facilitate this operation along the way, knowing that it wouldn't be the last time we would have to deal with this;
iii) push out a new EasyBuild release that includes the fixed checksums (likely a couple of 1000);
iv) somehow try to convince everyone to update to the latest EasyBuild release (the nature of open source implies that we have no way of knowing who actually uses EasyBuild, so it's effectively impossible to reach all our users, let alone convince them they should update);
v) deal with all incoming questions/complaints that are triggered because EasyBuild suddenly reports incorrect checksums for person X while it doesn't for person Y (because of cached downloads, for example) - a situation that would probably go on for days or even weeks;

I can guarantee that this whole ordeal on Jan 30th 2023 caused a lot of confusion all around the world, and will have an impact for days/weeks/months to come - people will have downloaded tarballs during those couple of hours that the change was in effect, use that to compute a SHA256 checksum, include that now incorrect checksum in a contribution to EasyBuild (we get about 2,500 contributions of this type each year...), others would report getting a different checksum, leading to another wild goose chase to figure out how on earth that's possible (cosmic rays maybe?!), etc.

TLDR: Although it's clear that the impact that this situation has had was unintended, it has most definitely made a lot of package managers cry.

0 replies

aaronmondal · 2023-02-03T00:55:56Z

aaronmondal
Feb 3, 2023

Thanks a lot for this discussion. It's really cool to see that GitHub cares about its users, including those with not-exactly-mainstream requirements.

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I'm the creator of rules_ll, an upstream Clang/LLVM based toolchain for C++ and heterogeneous programming.

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

We use Bazel, so in practice things broke for similar reasons others have already mentioned. Even without Bazel I think things wouldve broken for us though.

We download the llvm-project at specific commits, overlay custom build files, build it and then make that toolchain available to end users. We also pull in various smaller dependencies, sometimes at release tags, often at intermediary commits that contain e.g. bug fixes for compatibility with upstream Clang. Such smaller repos are e.g GitHub-hosted tooling from Nvidia and AMD. So we have several dependencies where we manually, explicitly track commits of git repos and the hashes of their tar archives.

We do this to propagate features with minimal delay to their original commit time, which can be months before "stable" releases and there is no way for us to depend on such stable releases in any way.

Building an entire C++ toolchain as part of a build is also quite the CPU cycle consuming task. To work around this we depend on caching that becomes more effective with stronger reproducibility guarantees.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

Yes. Our infrastructure was down. We noticed this because we flushed our cache. Our project was unusable and having to change hashes blocked all other development. Initially we were worried about a security breach, but thanks to your quick response in the original discussion we quickly noticed that things were fine.

However, we were also quite unsure how to handle updating these hashes. Adjusting them is not too much work, but also not completely negligible. Without knowing how frequently the hashes would change, we were unsure whether we needed just a quick patchup or a completely different way to handle this part of our project.

The incident caused us to consider building separate fallback mirroring infrastructure.

Maybe unintuitively, this has caused us to increase our efforts in relying on hashes, pinned versions, encapsulation etc. We were actually very happy that we were able to notice the change in the supply chain at all. Now we want this awareness for the few remaining parts we don't have control over yet.

In my case this means that I'm now working on encapsulating our project in Nix environments so that we have a reproducible host toolchain to build our actual bazel toolchain. Toolchainception!

Anything else you want me to know?

Similar to some others in this discussion, it doesn't actually matter to us if hashes change. If we know they'll change frequently we can build infrastructure to support this. If it's something like an update of git that causes git archive to behave differently, we can just build the latest git as part of a CI job that tests if the archive functionality changed or something like that.

We don't rely on hashes because we never want inputs to change, but because we need to know when they do change. Holding back updates like potentially more efficient compression is the last thing we want. We use these kinds of tools because they enable frequent, potentially agressive change and provide stability in the form of "rollbackability", not in the form of "we never change anything".

4 replies

ns-ggeorgiev Feb 7, 2023

I am bazel consumer (I spent whole day fixing hashes, and the next day again restoring the hashes)

There is an option 3: You can still have all archive stable, and you can have ability to change things. I can think of two options, there. are probably more:

Encode the version of the algorithm used to compress in the url. This way you can change the algo and maintain backward compatibility for places where old urls were used. You can deprecate old URL say after an year or so.
Determine the algo based on how old the release is. Say releases before 3/3/2023 use the old algo, after the new algo

aaronmondal Feb 8, 2023

This does sound like something that could work, at least in my use case. URLs may become a bit unwieldy if we had <url...>/<config_hash_of_current_compression_config>/<...> though. Maybe something like <url...>/<gh_compression_number>/<...>, where the middle part is just 1, 2, 3, etc?

I'm no expert regarding URL design, but wasn't there some important reason why an API shouldn't be designed that way?

ns-ggeorgiev Feb 8, 2023

I never heard about such concern, actually, vise versa, most APIs are encouraged to have v[n] in them from day one, for exactly that reason

aaronmondal Feb 8, 2023

Cool, then I think this approach would be a viable indeed.

It may even be possible to have depandabot crawl for such links and help propose updates (not to the git commit hash but to the compression scheme version) 👍

dg0yt · 2023-02-08T07:41:11Z

dg0yt
Feb 8, 2023

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

This is the perspective of a small open source project.
We offer an open source software package on GH, and for binary releases of this package we do offer another open source "superbuild" project to build dependencies, based on CMake's ExternalProject. Some dependencies are hosted on GitHub.
We build binaries for Windows, macOS and Android with Azure Pipelines (AZP).
We build packages for some Linux distros with openSUSE Build Service (OBS).

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

The superbuild project relies on precise bytes from the packages' source tarballs.
It is convenient to feed URL and URL_HASH parameters into CMake's ExternalProject.

The OBS builds occassionally include packaging of (particular versions of) dependencies.
Again, there are convenient tools to fetch a tarball and verify its checksum.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

The AZP and OBS builds weren't directly affected because we didn't have rebuilds at this time.
My local builds weren't affected because I'm carefully keeping local copies of any open source tarball that I use to create released binary artifacts.

Anything else you want me to know?

Looking at GH releases from the perspective of a software author, it seems fair to me to assume that the tarballs associated with a particular "Release" are handled as precise bytes similar to the assets uploaded to that "Release", not as a on-the-fly service. It seems redundant to upload a source tarball as an asset when GH has all information to create this asset, and even appears to provide it as an asset already.

In addition, my GH Release workflow makes use of "Drafts". Basically, I want the precise source tarball bits to be frozen when I create a draft release. Then I can create the binary assets from this specific tarball (by updating superbuild and OBS recipes), and publish the "Release" when the assets are validated and uploaded.

0 replies

jcar87 · 2023-02-14T09:41:46Z

jcar87
Feb 14, 2023

What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

I represent Conan Center (https://github.com/conan-io/conan-center-index). Conan Center provides C and C++ packages to be used with the Conan package manager. Users have an option to download pre-built binaries provided by us (we currently support a high number of operating systems, architectures and compilers), or alternatively to build them from source from a build recipe we provide.

Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

Whenever our packages are built from source, either by our own CI service or by users, we compare the checksum of the downloaded sources against what was expected for the particular version of a library or package. Not only is traceability important for us and our users (reproducible builds), the complexities of C++ mean that even seemingly innocuous changes can alter the observed behavior either at compile time or at runtime. We are also aware that this is important for our enterprise users, in particular those that work in industries with stringent traceability requirements, such as aerospace, automotive and medical.

Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

We were affected by this change, but the impact for our users was limited and indeed mitigated by GitHub’s decision to revert the change. A great number of our users will download binaries provided by us, and as such, do not build libraries from source and were not impacted by this change. For example, we provide 104 compiled binaries for the popular Protobuf library, covering a number of compiler, architectures and operating system combinations.
However, this was still noticed and reported by some our users, typically in some of the following scenarios, and indeed caused noticeable disruption:

Users that need binaries for less-popular platforms, or platforms not currently covered by our binary service
Enterprise users that have policies that prevent them from using externally-provided compiled binaries, who may still be using the build recipes from Conan Center to build from source.
Our own CI service for open pull requests where recipes are updated and binaries need to be rebuilt from source.

We already had a policy of advising recipe authors and contributors to prefer to use published artifacts from releases, as those are guaranteed to be hash-stable, rather than anything from /archive. However, and as mentioned elsewhere in this thread, GitHub’s archives have been hash-stable for so long that this created an impression that they would continue to be so. On the other hand, not all the libraries we build provide release artifacts (see below for a further explanation).

Anything else you want me to know?

While I believe we can greatly mitigate this by advising our contributors and recipe authors to choose hash-stable files for library releases, this is not always possible. An increasing number of library authors are no longer doing formal, versioned releases and instead encouraging users to “live at head”. Sometimes there isn't even a tag we can refer to - and all we have is a git commit hash. The lack of a formal version number is not a problem per se - the particular git revision can still be unequivocally and uniquely identified by the git hash of the upstream repository. However, the lack of a hash-stable downloadable archive can be a problem. Using git itself to download source code has proven problematic in the past. It tends to take longer and requires special care to avoid cloning the entire history of a repository when only one commit is needed. If memory serves me well, performing a shallow clone from arbitrary commit hash (rather than a branch or a tag, as some repositories don't provide releases or tags) was not possible until recent versions of the git client, and requires specific capabilities on the git server. Another advantage of using a single file with a hash is that it makes it easy to host mirrors (either official, or local to the organisation).

0 replies

grafikrobot · 2023-02-16T00:39:16Z

grafikrobot
Feb 16, 2023

1. What community/ecosystem/project do you represent? If you're in the core committer group for that community and/or part of a security committee, please mention that as well.

That's a complicated question.. I represent myself (as an OSS developer), and BoostOrg, and BFGroup, and Bincrafters, and CPPAlliance, and Conan (as a package contributor). Yes I wear many hats.

2. Does your ecosystem or tool rely on the precise bytes in a source download† from GitHub? If yes, please describe the dependency and what it powers or allows.

Yes, although @jcar87 already commented on Conan (https://github.com/orgs/community/discussions/46034#discussioncomment-4968810). But I also maintain an alternate Conan server implementation (https://barbarian.bfgroup.xyz/) that suffers from the same issues of recipes that use the archive hashes.

3. Were you affected by the change on January 30th? If so, how did you notice it, what impact did it have, and has it driven any change in your thinking/policies/tools in the meantime?

Not immediately as by the time I was aware of it and started to think about what I needed to do for the various Conan recipes I maintain the change had been reverted.

4. Anything else you want me to know?

Yes. I've been thinking about this since this post. In particular regarding @BillyONeal reply (https://github.com/orgs/community/discussions/46034#discussioncomment-4843932) about the safe vs. doom sections of the release assets list. And I had one idea that would help me, as a producer of OSS software and tools..

It would be fantastic if I could have a button/link/action to "bake" the "Repository code download archives" into "Repository release archives". And even better, having a setting that made that baking automatic for labeled releases and/or an option when creating a release that baked those archives. In other words, if relying on the code archives is the problem. The solution is to make it easy to not rely on them.

0 replies

tpudlik · 2024-09-06T17:58:26Z

tpudlik
Sep 6, 2024

@vtbassmatt I just hit a SHA256 change on an archive: according to the GitHub UI, https://github.com/STMicroelectronics/cmsis-core/archive/refs/tags/v5.4.0_cm4.tar.gz has not been updated since 2019, but the SHA256 changed yesterday from f711074a546bce04426c35e681446d69bc177435cd8f2f1395a52db64f52d100 to 32f226c31d7d1ff4a504404400603e047b99f405cd0c9a8f417f1f250251b829. Is this expected?

10 replies

eli-schwartz Sep 6, 2024

Do you have the old version of the archive? This would allow comparing the two via e.g. diffoscope.

tpudlik Sep 6, 2024

Sadly no. I only have the uncompressed contents of the old archive, and I verified they are the same as the contents of the "new" archive (via diff -qr).

tpudlik Sep 6, 2024

I figured it out. It's STM's fault. Sorry about the noise here.

STM apparently renamed their repository, from https://github.com/STMicroelectronics/cmsis_core/ to https://github.com/STMicroelectronics/cmsis-core/. The links all redirect, but the top-level folder in the compressed archive now has a different name (cmsis-core, not cmsis_core), so the checksum changed, too.

Answer selected by vtbassmatt

vtbassmatt Sep 6, 2024
Author

Glad you got to the bottom of it. I’m no longer with GitHub and was searching my records to figure out who to redirect you to. I’ll call off that particular search for now 😋

tpudlik Sep 6, 2024

I do think there's a sharp edge here---I don't know if STM were aware that renaming the repo will invalidate the archive checksums. Is there a good place to file an issue about this against GitHub?

vtbassmatt Sep 6, 2024
Author

@roferg I assume this is probably somewhere in your domain. Perhaps an explicit callout in docs would at least give people help troubleshooting when they encounter this?

eli-schwartz Sep 6, 2024

STM apparently renamed their repository

Yup, that turned out to be the real reason for pretty much all hash changes in the past ~decade other than the issue that spawned this Discussions ticket. :)

It's not obvious how to make that not an issue, since downloading an archive of a repository is "supposed" to have a root directory consisting of ${repo_name}-${git_ref}/ and new visitors to the repository would tend to expect that it would be the same as the current repository name.

GitHub archive hash stability #46034

Replies: 20 comments · 33 replies

vtbassmatt Sep 6, 2024 Author

vtbassmatt Sep 6, 2024 Author

Replies: 20 comments 33 replies

vtbassmatt Sep 6, 2024
Author

vtbassmatt Sep 6, 2024
Author