Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(target-gh): Check asset MD5 hash by downloading the asset #328

Merged
merged 6 commits into from
Nov 18, 2021

Conversation

iker-barriocanal
Copy link
Contributor

@iker-barriocanal iker-barriocanal commented Nov 16, 2021

The GitHub API doesn't provide any endpoints to get the hash of an asset.
GitHub's response contains an ETag header with the MD5 hash of the asset. In
certain cases, the ETag header is not present. This means we don't have the
hash of the asset, and thus, we cannot verify the asset correctness. To verify
it, we download the file and calculate the hash locally.

Files are downloaded into memory, and their size is not checked. This has a
risk of downloading too big files, which is something we accept at the moment.

The GitHub SDK throws exceptions when server responses are HTTP errors. These
interactions are wrapped and more detailed errors are thrown.

Originally, the hash check was introduced in #308.
This is a follow-up to the initial fix: #323.
Releases were failing before this fix, see getsentry/publish#635.

Closes #322.

The GitHub API doesn't provide any endpoints to get the hash of an asset.
GitHub's response contains an ETag header with the MD5 hash of the asset.  In
certain cases, the ETag header is not present. This means we don't have the
hash of the asset, and thus, we cannot verify the asset correctness. To verify
it, we download the file, and calculate the hash locally.

Files are downloaded into memory, and their size is not checked. This has a
risk of downloading too big files, which is something we accept at the moment.

The GitHub SDK throws exceptions when server responses are HTTP errors. These
interactions are wrapped and more detailed errors are thrown.
src/targets/github.ts Outdated Show resolved Hide resolved
Copy link
Member

@chadwhitacre chadwhitacre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, a couple questions/suggestions.

@iker-barriocanal
Copy link
Contributor Author

iker-barriocanal commented Nov 17, 2021

Let's summarize the conversation so far:

  • A small helper function to guarantee that remoteChecksum and localChecksum remain identical. Done in e016782.
  • Preserve the note about this local computation only working for simple cases. Done in 477308a.

We also identified a bug in 469925d, where the spinner wasn't being completed. This has been fixed in 5083340.

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but since I'm biased as @iker-barriocanal and I worked together on this, let's wait for @chadwhitacre to chime in.

Copy link
Member

@chadwhitacre chadwhitacre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! :shipit:

@iker-barriocanal iker-barriocanal merged commit d14a988 into master Nov 18, 2021
@iker-barriocanal iker-barriocanal deleted the iker/fix/gh-checksum branch November 18, 2021 10:07
src/targets/github.ts Show resolved Hide resolved
Accept: DEFAULT_CONTENT_TYPE,
},
});
} catch (e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if catching the original exception and throwing a new one is actually good here as you obscure the original issue. In debug mode, we already log the response status etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BYK !!

In this case here, we were seeing something like "Failed: Not Found" mixed up in the logs. So the RequestError (the e) from Octokit didn't provide much context, and not knowing clearly if the problem was in HEAD (the original ETag method) or GET (the new download method added in this PR) was not a great prospect.

The original error is still printed along with a custom message. Do you think we could do better?

// XXX: This is a bit hacky as we rely on various things:
// 1. GitHub issuing a redirect to AWS S3.
// 2. S3 using the MD5 hash of the file for its ETag cache header.
// 3. The file being small enough to fit in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating MD5 hash from a streaming data is quite trivial in Node so I'd encourage you to revisit this approach in a follow up PR as it causes unnecessary memory usage. We may have rather large files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @iker-barriocanal and I talked extensively about this.

Trade-offs:

  • Octokit's request method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from using request. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests -- @BYK can you confirm that?
  • We don't work with streams when doing the upload. We read (with readSync!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there.
  • If memory usage becomes a concern, then certainly we need to revisit this, but probably then not only the fallback download step.
  • We've filed issues here on GitHub and internal tracking tickets for future improvements. Perhaps none specifically about streaming -- now done in GitHub target: reduce memory consumption by streaming uploads/downloads #329

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Octokit's request method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from using request. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests

Octokit takes care of a bunch of things from authentication to rate limiting to parallelization to automatic retries. So I'd say stick with it as long as it can. If solving this issue requires a departure from Octokit, delay that as long as possible and file an issue with them ahead of time so they might be able to bring this into the library itself.

We don't work with streams when doing the upload. We read (with readSync!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there.

Great catch. This is again due to Octokit limitations tho as it does not support streams or piping, only buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle remote checksum edge cases
4 participants