-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(target-gh): Check asset MD5 hash by downloading the asset #328
Conversation
The GitHub API doesn't provide any endpoints to get the hash of an asset. GitHub's response contains an ETag header with the MD5 hash of the asset. In certain cases, the ETag header is not present. This means we don't have the hash of the asset, and thus, we cannot verify the asset correctness. To verify it, we download the file, and calculate the hash locally. Files are downloaded into memory, and their size is not checked. This has a risk of downloading too big files, which is something we accept at the moment. The GitHub SDK throws exceptions when server responses are HTTP errors. These interactions are wrapped and more detailed errors are thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, a couple questions/suggestions.
Let's summarize the conversation so far:
We also identified a bug in 469925d, where the spinner wasn't being completed. This has been fixed in 5083340. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but since I'm biased as @iker-barriocanal and I worked together on this, let's wait for @chadwhitacre to chime in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Accept: DEFAULT_CONTENT_TYPE, | ||
}, | ||
}); | ||
} catch (e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if catching the original exception and throwing a new one is actually good here as you obscure the original issue. In debug mode, we already log the response status etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BYK !!
In this case here, we were seeing something like "Failed: Not Found" mixed up in the logs. So the RequestError
(the e
) from Octokit didn't provide much context, and not knowing clearly if the problem was in HEAD (the original ETag method) or GET (the new download method added in this PR) was not a great prospect.
The original error is still printed along with a custom message. Do you think we could do better?
// XXX: This is a bit hacky as we rely on various things: | ||
// 1. GitHub issuing a redirect to AWS S3. | ||
// 2. S3 using the MD5 hash of the file for its ETag cache header. | ||
// 3. The file being small enough to fit in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating MD5 hash from a streaming data is quite trivial in Node so I'd encourage you to revisit this approach in a follow up PR as it causes unnecessary memory usage. We may have rather large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @iker-barriocanal and I talked extensively about this.
Trade-offs:
- Octokit's
request
method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from usingrequest
. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests -- @BYK can you confirm that? - We don't work with streams when doing the upload. We read (with
readSync
!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there. - If memory usage becomes a concern, then certainly we need to revisit this, but probably then not only the fallback download step.
- We've filed issues here on GitHub and internal tracking tickets for future improvements. Perhaps none specifically about streaming -- now done in GitHub target: reduce memory consumption by streaming uploads/downloads #329
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Octokit's request method seems to buffer the full response payload, so if we wanted to work with streams we'd need to depart from using request. I believe we're originally using Octokit in this code path to benefit from the different rate limits given to authenticated requests
Octokit takes care of a bunch of things from authentication to rate limiting to parallelization to automatic retries. So I'd say stick with it as long as it can. If solving this issue requires a departure from Octokit, delay that as long as possible and file an issue with them ahead of time so they might be able to bring this into the library itself.
We don't work with streams when doing the upload. We read (with readSync!) full asset files from disk into memory before uploading. We use the same bytes in memory to calculate the "local checksum". So the assumption that files are small enough was implicitly already there.
Great catch. This is again due to Octokit limitations tho as it does not support streams or piping, only buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
References to our future selves (?):
- Downloading a tarball of a migration archive octokit/octokit.js#916 (comment)
- How to use streaming in createfile and updatefile octokit/octokit.js#887 (comment)
2018 was a fine year, right? :)
The GitHub API doesn't provide any endpoints to get the hash of an asset.
GitHub's response contains an ETag header with the MD5 hash of the asset. In
certain cases, the ETag header is not present. This means we don't have the
hash of the asset, and thus, we cannot verify the asset correctness. To verify
it, we download the file and calculate the hash locally.
Files are downloaded into memory, and their size is not checked. This has a
risk of downloading too big files, which is something we accept at the moment.
The GitHub SDK throws exceptions when server responses are HTTP errors. These
interactions are wrapped and more detailed errors are thrown.
Originally, the hash check was introduced in #308.
This is a follow-up to the initial fix: #323.
Releases were failing before this fix, see getsentry/publish#635.
Closes #322.