-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected signature in zip file error #50
Comments
If you haven't changed your artifact retention settings, the zip file causing the issue will be retained for 90 days by default. As long as your build is in a public repo, then you could just link to the specific build page and the artifact .zip file would be available to download from there. |
Unfortunately, even though it's an open-source project, we have our retention time set to 1 day. We use these artifacts solely for the purpose of moving build results from the build jobs to the test jobs, so there's no need to keep them. |
https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/attaching-files says the maximum file size for issue attachments is 10 MB for images and videos (100 MB for videos if the uploader has a paid plan) and 25 MB for all other file types, so it's a moot point. If you want to share the failing .zip file, you'll need to upload it to a service like Google Drive or Mega Share. |
I've put the zip archive here. The archive can be opened and the contents extracted by every other zip utility I've tried. It was produced by I've been using the following script to run #!/usr/bin/env node
'use strict'
const fs = require('fs');
const unzip = require('unzip-stream');
fs.createReadStream(process.argv[2])
.pipe(unzip.Extract({
path: process.argv[3],
debug: true
})); The results are like this: $ scripts/unzip.js ~/Downloads/state-8.10.7-ubuntu-latest.zip output
decoded LOCAL_FILE_HEADER: {
"versionsNeededToExtract": 20,
"flags": "0x8",
"compressionMethod": 8,
"lastModifiedTime": 29644,
"lastModifiedDate": 22738,
"crc32": 0,
"compressedSize": 0,
"uncompressedSize": 0,
"fileNameLength": 9,
"extraFieldLength": 0,
"extra": {},
"path": "state.tar",
"extraFields": []
}
Skipped 1 bytes
Unexpected signature in zip file: 0x16d4b50 "PKm", skipped 1 bytes
node:events:496
throw er; // Unhandled 'error' event
^
Error: Invalid signature in zip file
at UnzipStream.processDataChunk (~/src/actions/toolkit/node_modules/unzip-stream/lib/unzip-stream.js:155:40)
at UnzipStream._parseOrOutput (~/src/actions/toolkit/node_modules/unzip-stream/lib/unzip-stream.js:651:28)
at Array.done (~/src/actions/toolkit/node_modules/unzip-stream/lib/unzip-stream.js:713:18)
at callFinishedCallbacks (node:internal/streams/writable:973:25)
at finish (node:internal/streams/writable:944:3)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
Emitted 'error' event on Extract instance at:
at emitErrorNT (node:internal/streams/destroy:169:8)
at emitErrorCloseNT (node:internal/streams/destroy:128:3)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
Node.js v20.12.2
|
The offending 0x50 0x4b 0x6d 0x01 sequence appears right at the end of the file, about where the central file directory header (0x50 0x4b 0x01 0x02) would be expected. I was able to look at @neilmayhew's project yesterday, and downloaded a couple of zip files produced by a different run of the GitHub Actions workflow before the artifacts expired, so I have another bad .zip file with the exact same 0x50 0x4b 0x6d 0x01 signature, and a good .zip file produced by the same build run. Here are some hexdumps from the ends of those three zip files: The .zip file that @neilmayhew linked to in #50 (comment):
Another .zip file that failed, from a different run of the same build process:
A valid .zip file from the same build process, but a different combination of matrix flags so it didn't trigger whatever is producing these 0x50 0x4b 0x6d 0x01 signatures:
|
My assumption, after looking at the code for a bit, is that it's legitimate for archive creators to insert padding bytes between segments, and that it's difficult for a streaming consumer of the archive to navigate these correctly if they happen to contain garbage. A non-streaming consumer can use the directory entries to seek to the right offset in the file and thus never sees the garbage. As the README says,
It seems that my project has been hitting those cases in which it doesn't succeed, perhaps because of the size of the files being archived. This would explain why it fails for some of the matrix jobs and not others, because the older compiler produces slightly larger build results. We found that gzipping the tar archive that's placed in the zip archive bypassed the problem, perhaps because the files are smaller. However, this double compression adds about 30min to our CI time, so the downgrade approach is preferable. |
The fact that the "bad signature" appears near the end of the file is consistent with the behaviour I've observed, which is that it takes a while for the |
I think I've figured out what's going on. The file size of the uncompressed state.tar file is 6128619520 bytes. If you convert that to hex, 6128619520 is 0x16d4b5000. Does that hex value look familiar? Skip 1 byte in little-endian order and you get 0x16d4b50, the very value that unzip-stream was interpreting as a signature since it contains the bytes 0x50 0x4b. In fact, the hexdumps I posted in my previous comment were just one line too short. I should have started with the previous line:
Here we see a 0x50 0x4b 0x07 0x08 signature, which marks the start of a data descriptor as follows: Signature: 504b 0708 So what's happening is that unzip-stream is reading the data descriptor header, but incorrectly thinking that it will have 4-byte sizes rather than 8-byte sizes. This results in unzip-stream starting to look for the next data descriptor header right at the start of the uncompressed size bytes. And they just so happen to have 0x50 0x4b in them, so unzip-stream says "Oh, a signature marker" and processes it as such, then throws an error when the signature marker turns out to be invalid. If instead unzip-stream had skipped the invalid signature marker and kept on looking, it would have found a valid 0x50 0x4b 0x01 0x02 marker just seven bytes later. But wait: why did unzip-stream think the data descriptor would have 4-byte file sizes? Because here's the header of the file and what it means:
Signature: 50 4b 03 04 (local file header) Now, the .zip format documentation says this in section 4.4.9:
Here there was no extra field, and unzip-stream's assumption is that sizes will be 8 bytes if and only if there's a zip64 extra field. E.g., this code that looks for the data descriptor in var fileSizeKnown = !(vars.flags & 0x08);
if (fileSizeKnown) {
entry.size = vars.uncompressedSize;
}
// ...
if (!fileSizeKnown) {
var pattern = new Buffer(4);
pattern.writeUInt32LE(SIG_DATA_DESCRIPTOR, 0);
var zip64Mode = vars.extra.zip64Mode;
var extraSize = zip64Mode ? 20 : 12;
var searchPattern = {
pattern: pattern,
requiredExtraSize: extraSize
}
// ...
} If you search for However, this zip file proves that some programs that create .zip files do not set zip64 mode. Is this a bug in the creating program? Well, yes: section 4.3.9 of the zip file standard says:
So 8-byte sizes in the absence of a zip64 marker are technically in violation of the spec. However, due to the fact that this .zip file was produced by GitHub Actions' upload-artifact action (which streams the .zip to its output), it was not actually possible for the compressed file size to be known ahead of time. And since the spec requires the compressed file size to be written into the zip64 extra field, the upload-artifact action chose not to write one because its compressed file size would be invalid anyway. Which means that unzip-stream probably needs to try both the 4-byte version (total 4+4+4 = 12 bytes) and the 8-byte version (total 4+8+8 = 20 bytes) of the data descriptor, and check whether it finds a valid signature immediately following each. If a valid signature is found after either one, then that will tell you whether the sizes are 4 btyes or 8 bytes. (It is not possible for a valid signature to be found after both: if the data descriptor was supposed to be 12 bytes and you read 20 bytes, you'll be 8 bytes into the next block, which will put you in either the "flags" or "compression method" field. Either way, 0x504b would be a totally invalid value for flags or compression method, and will not appear.) Update: Actually, a better approach would be to read the compressed file size as either a 4-byte value or an 8-byte value, and see if either one corresponds to the actual compressed size encountered thus far (which will be known at that point). If both are valid (and thus the same value), then the correct answer is to use 8-byte values. Why? Well, if the compressed file size is in fact less than 4GiB (let's say it's 0x12345678) but is written as an 8-byte value, then it will look like 78 56 34 12 00 00 00 00 and reading it as a 4-byte value or an 8-byte value will be correct. BUT if the compressed file size was actually written as a 4-byte value, then it will be followed by an uncompressed size, e.g. 78 56 34 12 f3 c8 a0 9a, and reading that as an 8-bit value will be totally wrong. Therefore if both a 4-byte read and an 8-byte read of the compressed file size produce the same value, then the 8-byte read is the correct one. (Besides which, if it was actually the 4-byte version it would mean that the uncompressed file size was 0 bytes, and a file of 0 bytes would have been stored with 0 bytes of compressed size, and there wouldn't be any 4-byte vs. 8-byte confusion). |
@rmunn Amazing! You've done some excellent detective work there. So is your conclusion that the archive is invalid, because it doesn't indicate Zip64 format correctly? Or that unzip-stream's logic is incorrect? |
@neilmayhew - I think I answered that with the edit I was making while you commented. :-) But basically, it's both. Therefore, my conclusion (at the bottom of my now-edited previous comment) is that unzip-stream should, on encountering a data descriptor header, try reading it in both 12-byte and 20-byte sizes, and try parsing the file sizes as either 4 bytes or 8 bytes. If the 8-byte size parsing produces incorrect values (the compressed file size is known at that point so that's an easy one to compare with because it occupies bytes 5 through 9 or bytes 5 through 12, so either way the bytes you're interpreting are entirely from the data descriptor header), then you know the 12-byte version (with 4-byte file sizes) was the correct one, and you have read 8 bytes too far into the stream. Push those 8 bytes back into your parser because they almost certainly start with another marker (probably a 0x50 0x4b 0x01 0x02 marker) and parse them. Alternately, just read the 12-byte version and try to parse it. If compressed file size matches the known file size you received and is non-zero, but the uncompressed file size (the next 4 bytes) is zero, then you know that you actually are dealing with an 8-byte file size, and you need to read 8 more bytes to get the uncompressed file size. (You can know this because the spec says that a zero-byte file must be stored with no file data in the .zip file, and therefore the compressed file size should also have been zero. So if you see a 4-byte compressed file size that is non-zero, followed by four zero bytes, then you can know for certain that you're really looking at an 8-byte file size that is less than 4 GiB). |
So this bug could only trigger under the following conditions:
The chances of all that happening are extremely small, but here we are. Congratulations, @neilmayhew - you have won the bug lottery. 😀 |
Correction: the zip file is valid. There is, in fact, a zip64 extra field — but it's at the end of the file (in the central directory record), not immediately after the local file header. Non-streaming unzip software would therefore look at the central directory record, see that there's a zip64 extra field, and know that the file sizes involved would be 8 bytes. So that software, if it even looked at the data-description field at all (which wouldn't really be necessary given that info is all duplicated in the end-of-file central directory record), would know that it should be 20 bytes in size, not 12 bytes. But unzip-stream can't see the end-of-file directory while streaming, so it actually has no way at all to tell if the data-description field is going to be 12 bytes or 20 bytes, because it can't count on having seen a zip64 extra field before reaching the end of the zip file. The only reliable way to tell is going to be to read the compressed file size as an 8-byte value, then do one of two things:
|
Thanks, @rmunn, for this very comprehensive analysis. Stepping back a bit, I see two things. First, the zip archive format is messy because extending it to support 64 bits was tricky. Second, it really isn't friendly to being read in a streaming fashion. There have been calls to switch the format to I'm not sure what the logistical implications of supporting There's a tar package on NPM that seems to be widely used and actively maintained. However, forking the artifact actions repos really means forking the toolkit repo, which contains a lot of other functionality that would have to be maintained in parallel with the upstream one. |
Another alternative has just occurred to me. If the upload action could arrange for the Zip64 marker to be put at the beginning of the file instead of at the end, this would make it possible to read the file in a streaming fashion, and avoid having to make changes to |
The upload library streams its contents to the destination as it zips them, which is why it can't know the compressed size ahead of time. And the zip spec says that the marker at the beginning of the file, if present, is supposed to include the compressed size. The only way to implement that suggestion would be to change the upload library to buffer each file as it's compressed, then write the compressed version to the stream. Which could consume a lot of memory, especially since people are working around the file-permissions bug by tarring the files beforehand and uploading a single tar file. So in cases like yours, basically the entire zip output would end up cached in RAM before being streamed out all at once, thereby entirely defeating the point of streaming it. Streaming in tar format, though, would be quite an easier change to make as the library that upload-action is using already supports tar format. |
Downgrading |
@rmunn That was an excellent analysis, seems like unzip-stream could try to handle these cases, care to write a patch for this scenario? |
The full message (with debug logging) is:
This zip file is one that was created by the
upload-artifact
GitHub Action. Thedownload-artifact
action is unable to download the artifact because it's usingunzip-stream
which is unable to extract it.I can provide the zip, but it's fairly large (938M) so I'm not sure if I should attach it here.
The text was updated successfully, but these errors were encountered: