feat(target-gh): Check asset MD5 hash by downloading the asset #328

iker-barriocanal · 2021-11-16T14:31:49Z

The GitHub API doesn't provide any endpoints to get the hash of an asset.
GitHub's response contains an ETag header with the MD5 hash of the asset. In
certain cases, the ETag header is not present. This means we don't have the
hash of the asset, and thus, we cannot verify the asset correctness. To verify
it, we download the file and calculate the hash locally.

Files are downloaded into memory, and their size is not checked. This has a
risk of downloading too big files, which is something we accept at the moment.

The GitHub SDK throws exceptions when server responses are HTTP errors. These
interactions are wrapped and more detailed errors are thrown.

Originally, the hash check was introduced in #308.
This is a follow-up to the initial fix: #323.
Releases were failing before this fix, see getsentry/publish#635.

Closes #322.

The GitHub API doesn't provide any endpoints to get the hash of an asset. GitHub's response contains an ETag header with the MD5 hash of the asset. In certain cases, the ETag header is not present. This means we don't have the hash of the asset, and thus, we cannot verify the asset correctness. To verify it, we download the file, and calculate the hash locally. Files are downloaded into memory, and their size is not checked. This has a risk of downloading too big files, which is something we accept at the moment. The GitHub SDK throws exceptions when server responses are HTTP errors. These interactions are wrapped and more detailed errors are thrown.

src/targets/github.ts

chadwhitacre

Looks good overall, a couple questions/suggestions.

iker-barriocanal · 2021-11-17T14:32:37Z

Let's summarize the conversation so far:

A small helper function to guarantee that remoteChecksum and localChecksum remain identical. Done in e016782.
Preserve the note about this local computation only working for simple cases. Done in 477308a.

We also identified a bug in 469925d, where the spinner wasn't being completed. This has been fixed in 5083340.

rhcarvalho

LGTM, but since I'm biased as @iker-barriocanal and I worked together on this, let's wait for @chadwhitacre to chime in.

chadwhitacre

Looks great!

src/targets/github.ts

BYK · 2021-11-18T18:48:21Z

src/targets/github.ts

+          Accept: DEFAULT_CONTENT_TYPE,
+        },
+      });
+    } catch (e) {


I'm not sure if catching the original exception and throwing a new one is actually good here as you obscure the original issue. In debug mode, we already log the response status etc.

Thanks @BYK !!

In this case here, we were seeing something like "Failed: Not Found" mixed up in the logs. So the RequestError (the e) from Octokit didn't provide much context, and not knowing clearly if the problem was in HEAD (the original ETag method) or GET (the new download method added in this PR) was not a great prospect.

The original error is still printed along with a custom message. Do you think we could do better?

BYK · 2021-11-18T18:49:33Z

src/targets/github.ts

+    // XXX: This is a bit hacky as we rely on various things:
+    // 1. GitHub issuing a redirect to AWS S3.
+    // 2. S3 using the MD5 hash of the file for its ETag cache header.
+    // 3. The file being small enough to fit in memory.


Calculating MD5 hash from a streaming data is quite trivial in Node so I'd encourage you to revisit this approach in a follow up PR as it causes unnecessary memory usage. We may have rather large files.

Yes, @iker-barriocanal and I talked extensively about this.

Trade-offs:

Octokit's request method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from using request. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests -- @BYK can you confirm that?

We don't work with streams when doing the upload. We read (with readSync!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there.

If memory usage becomes a concern, then certainly we need to revisit this, but probably then not only the fallback download step.

We've filed issues here on GitHub and internal tracking tickets for future improvements. Perhaps none specifically about streaming -- now done in GitHub target: reduce memory consumption by streaming uploads/downloads #329

Octokit's request method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from using request. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests

Octokit takes care of a bunch of things from authentication to rate limiting to parallelization to automatic retries. So I'd say stick with it as long as it can. If solving this issue requires a departure from Octokit, delay that as long as possible and file an issue with them ahead of time so they might be able to bring this into the library itself.

We don't work with streams when doing the upload. We read (with readSync!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there.

Great catch. This is again due to Octokit limitations tho as it does not support streams or piping, only buffers.

References to our future selves (?):

Downloading a tarball of a migration archive octokit/octokit.js#916 (comment)

How to use streaming in createfile and updatefile octokit/octokit.js#887 (comment)

2018 was a fine year, right? :)

iker-barriocanal added 2 commits November 16, 2021 14:21

Extract method getRemoteChecksum

469925d

iker-barriocanal added this to the Increase Release Safety and Reliability milestone Nov 16, 2021

iker-barriocanal requested review from rhcarvalho and chadwhitacre November 16, 2021 14:31

iker-barriocanal self-assigned this Nov 16, 2021

chadwhitacre reviewed Nov 16, 2021

View reviewed changes

src/targets/github.ts Outdated Show resolved Hide resolved

chadwhitacre reviewed Nov 16, 2021

View reviewed changes

src/targets/github.ts Show resolved Hide resolved

chadwhitacre requested changes Nov 16, 2021

View reviewed changes

iker-barriocanal added 2 commits November 17, 2021 14:49

Succeed spinner when asset has been uploaded

5083340

Use common method to calculate checksums

e016782

iker-barriocanal added 2 commits November 17, 2021 15:56

Rename getting checksums from url and file

4c889a7

Improve comment of assumptions on getting checksum

477308a

iker-barriocanal requested a review from chadwhitacre November 17, 2021 16:37

rhcarvalho approved these changes Nov 17, 2021

View reviewed changes

chadwhitacre approved these changes Nov 17, 2021

View reviewed changes

iker-barriocanal merged commit d14a988 into master Nov 18, 2021

iker-barriocanal deleted the iker/fix/gh-checksum branch November 18, 2021 10:07

BYK reviewed Nov 18, 2021

View reviewed changes

rhcarvalho mentioned this pull request Nov 18, 2021

GitHub target: reduce memory consumption by streaming uploads/downloads #329

Open

rhcarvalho mentioned this pull request Dec 7, 2021

fix: Remove GitHub asset checksum #333

Merged

iker-barriocanal mentioned this pull request Dec 16, 2021

Improve GitHub asset upload retry algorithm #337

Open

kamilogorek mentioned this pull request Oct 12, 2022

Do not rely on hacky solution for checksum validation #418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(target-gh): Check asset MD5 hash by downloading the asset #328

feat(target-gh): Check asset MD5 hash by downloading the asset #328

iker-barriocanal commented Nov 16, 2021 •

edited

Loading

chadwhitacre left a comment

iker-barriocanal commented Nov 17, 2021 •

edited

Loading

rhcarvalho left a comment

chadwhitacre left a comment

BYK Nov 18, 2021

rhcarvalho Nov 18, 2021

BYK Nov 18, 2021

rhcarvalho Nov 18, 2021

BYK Nov 19, 2021

rhcarvalho Nov 19, 2021

feat(target-gh): Check asset MD5 hash by downloading the asset #328

feat(target-gh): Check asset MD5 hash by downloading the asset #328

Conversation

iker-barriocanal commented Nov 16, 2021 • edited Loading

chadwhitacre left a comment

Choose a reason for hiding this comment

iker-barriocanal commented Nov 17, 2021 • edited Loading

rhcarvalho left a comment

Choose a reason for hiding this comment

chadwhitacre left a comment

Choose a reason for hiding this comment

BYK Nov 18, 2021

Choose a reason for hiding this comment

rhcarvalho Nov 18, 2021

Choose a reason for hiding this comment

BYK Nov 18, 2021

Choose a reason for hiding this comment

rhcarvalho Nov 18, 2021

Choose a reason for hiding this comment

BYK Nov 19, 2021

Choose a reason for hiding this comment

rhcarvalho Nov 19, 2021

Choose a reason for hiding this comment

iker-barriocanal commented Nov 16, 2021 •

edited

Loading

iker-barriocanal commented Nov 17, 2021 •

edited

Loading