Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Remove GitHub asset checksum #333

Merged
merged 1 commit into from
Dec 7, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 3 additions & 71 deletions src/targets/github.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
import { Octokit, RestEndpointMethodTypes } from '@octokit/rest';
import { RequestError } from '@octokit/request-error';
import { readFileSync, promises, statSync } from 'fs';
import { createReadStream, promises, statSync } from 'fs';
import { basename } from 'path';
import { BinaryLike, createHash } from 'crypto';

import { getConfiguration } from '../config';
import {
Expand Down Expand Up @@ -361,97 +360,30 @@ export class GithubTarget extends BaseTarget {
const uploadSpinner = ora(
`Uploading asset "${name}" to ${this.githubConfig.owner}/${this.githubConfig.repo}:${release.tag_name}`
).start();

try {
const file = readFileSync(path);
const file = createReadStream(path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of the files we upload are small files, using readFileSync is actually safer and more efficient, especially compared to a read stream. This is a lesson we learned from npm folks when working on Yarn: yarnpkg/yarn#3539

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference, @BYK.

The change from createReadStream to readFileSync was not documented in #290, seemed like the motivation was just to get types to match (string) 🤔 which led to the broken asset uploads.

Later in #328 (comment) we were discussing streaming... (at that time I had not realized that we changed from stream to read-all also in #290, relatively recently), and the concern was memory usage.

I haven't benchmarked, but I believe for our use case with Craft the performance difference probably doesn't matter?

readFileSync is actually safer and more efficient

I can intuitively understand the "more efficient", but, I'm curious, why do you say it is also "safer"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is OK is because pretty much any files this would
handle would fit neatly into memory (any npm packages MUST fit
into memory by definition, because of the way npm@<5 does extracts).

If you really want to make doubleplus sure to minimize memory usage,
you could do an fs.stat to find the file size and then do heuristics
to only use streams for files bigger than MB.

Quoted from yarnpkg/yarn#3539

Craft asset sizes, theoretically, don't really need to fit in memory (though we're probably not uploading any large assets in Sentry, and I don't even know what GitHub's limit is tbh).

The second paragraph applies -- we already do a stat on the asset file, we could choose between readFileSync and createReadStream, but not sure the added complexity is worth it. More code, more surface area for things to go wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from createReadStream to readFileSync was not documented in #290, seemed like the motivation was just to get types to match (string) 🤔 which led to the broken asset uploads.

Yup, my bad for not making the reason for change explicit, sorry 😞

Re in #328, I think we all decided that the asset sizes not fitting into memory was a distant potential issue, hence me being content with readFileSync.

I haven't benchmarked, but I believe for our use case with Craft the performance difference probably doesn't matter?

You may be surprised 🙂

I can intuitively understand the "more efficient", but, I'm curious, why do you say it is also "safer"?

Because dealing with streams in Node is close to insanity. In a world dominated by promises nowadays, streams still rely on events and it is very easy to not handle a case where the stream errored out etc. Now you'd expect the Octokit library would handle that but we've already seen that this API was not stable enough with the bad typing information (yes this sounds like FUD but essentially, based on my past experience with streams, I'd try to avoid them at all costs).

Anyway, I just wanted to provide the context around it. This code worked with createReadStream for years so I don't think there's a strong case for going either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BYK, I really appreciate you taking the time to provide us so much insight 🙇 ❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cut #334 to whomever comes next maintaining Craft, no need to rush to make changes now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL streams <> node.

const { url, size } = await this.handleGitHubUpload({
...params,
// XXX: Octokit types this out as string, but in fact it also
// accepts a `Buffer` here. In fact passing a string is not what we
// want as we upload binary data.
data: file as any,
});

uploadSpinner.text = `Verifying asset "${name}...`;
if (size != stats.size) {
throw new Error(
`Uploaded asset size does not match local asset size for "${name} (${stats.size} != ${size}).`
);
}

const remoteChecksum = await this.checksumFromUrl(url);
const localChecksum = this.checksumFromData(file);
if (localChecksum !== remoteChecksum) {
throw new Error(
`Uploaded asset checksum does not match local asset checksum for "${name} (${localChecksum} != ${remoteChecksum})`
);
}
uploadSpinner.succeed(`Uploaded asset "${name}".`);
return url;
} catch (e) {
uploadSpinner.fail(`Cannot upload asset "${name}".`);

throw e;
}
}

private async checksumFromUrl(url: string): Promise<string> {
// XXX: This is a bit hacky as we rely on various things:
// 1. GitHub issuing a redirect to AWS S3.
// 2. S3 using the MD5 hash of the file for its ETag cache header.
// 3. The file being small enough to fit in memory.
//
// Note that if assets are large (5GB) assumption 2 is not correct. See
// https://github.com/getsentry/craft/issues/322#issuecomment-964303174
let response;
try {
response = await this.github.request(`HEAD ${url}`, {
headers: {
// WARNING: You **MUST** pass this accept header otherwise you'll
// get a useless JSON API response back, instead of getting
// redirected to the raw file itself.
// And don't even think about using `browser_download_url`
// field as it is close to impossible to authenticate for
// that URL with a token and you'll lose hours getting 404s
// for private repos. Consider yourself warned. --xoxo BYK
Accept: DEFAULT_CONTENT_TYPE,
},
});
} catch (e) {
throw new Error(
`Cannot get asset on GitHub. Status: ${(e as any).status}\n` + e
);
}

const etag = response.headers['etag'];
if (etag && etag.length > 0) {
// ETag header comes in quotes for some reason so strip those
return etag.slice(1, -1);
}

return await this.md5FromUrl(url);
}

private async md5FromUrl(url: string): Promise<string> {
this.logger.debug('Downloading asset from GitHub to check MD5 hash: ', url);
let response;
try {
response = await this.github.request(`GET ${url}`, {
headers: {
Accept: DEFAULT_CONTENT_TYPE,
},
});
} catch (e) {
throw new Error(
`Cannot download asset from GitHub. Status: ${(e as any).status}\n` + e
);
}
return this.checksumFromData(Buffer.from(response.data));
}

private checksumFromData(data: BinaryLike): string {
return createHash('md5').update(data).digest('hex');
}

private async handleGitHubUpload(
params: RestEndpointMethodTypes['repos']['uploadReleaseAsset']['parameters'],
retries = 3
Expand Down