-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Add support for concatenated gzip files. #6442
Conversation
Thank you for contributing this pull request! Here are a few pointers to make sure your submission will be considered for inclusion. Commit eendeego/node@d69b312 has the following error(s):
Commit eendeego/node@7934cc7 has the following error(s):
The following commiters were not found in the CLA:
You can fix all these things without opening another issue. Please see CONTRIBUTING.md for more information |
Already Signed the CLA as "Luis Miguel Fonseca dos Reis" |
Thanks for contributing this! Perhaps, it'd be better if zlib would be able to do "partial" decompression and return the offset it has stopped at? /cc @isaacs |
That can be done, but there must be a way to know the Z_STREAM_END status and issue the necessary inflateReset (which I think is already visible in JS space.) |
What the behaviour is when there is a garbage at the end, something like gzip for example decompresses it properly, but prints out a warning to stdout:
|
Hi @indutny, I've been looking at some form to emit data events for each [zlib] stream termination. I have these issues:
Z_STREAM_END handling on node_zlib.cc / After function:
Here's the stack trace for the failed test:
Note that the test fails exactly with the same error even if
What am I doing/assuming wrong ? |
Wow, thanks for trying it out! I think that it should be ok if it would just return offset that it stopped at, after that zlib.js can invoke inflate again with So, please don't call it in a loop and let js handle it. This should be much simpler than writing all this logic in C++. Please, let me know if you need any other hints and/or help! |
This last commit moves stream end handling to JS land - I'm not sure the test on the To be able to handle the garbage example as given by @rlidwka, I had to create a new send_error var on the worker After callback due to race conditions. Waiting for feedback... |
Sorry, I know it took awhile for me to get there. But I'm still not ready to give you any feedback. I don't really enjoy that it does it automatically, but I'm not sure how it should look API-wise. I'll figure it out a bit later. |
self._buffer, | ||
self._offset, | ||
self._chunkSize); | ||
newReq.callback = callback; // this same function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will lint complain about not having two spaces between the code and the comment, or is that just for cpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just the same as every comment in the same file - actually that line it's copied from some lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. wasn't worried, but also couldn't remember which lint would check for that.
This'll need to be rebased. |
@trevnorris Apart from the Sorry. |
@trevnorris I don't really enjoy how it automatically continues decompressing it without letting user know that it does it. I think it should be explicit. In other words, ideally, I think that there should be a The main problem is that this mode is not really supported by streams. @isaacs if you have time - may I ask your advice on this topic? |
@indutny Implementations on other languages also do this transparently. How about some (ignorable) event sent to the client code ? |
@luismreis: I had a conversation with @tjfontaine, I think we are going to land it. @tjfontaine was going to review it for one more time. Assigning issue to him |
@tjfontaine reminder |
@@ -510,6 +510,25 @@ Zlib.prototype._transform = function(chunk, encoding, cb) { | |||
return; | |||
} | |||
|
|||
if (availInAfter > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a sane way to refactor this such that these two clauses can be combined instead of duplicated?
I'd like to land this, but am hoping you're able to refactor to reduce some of the duplication |
@luismreis heya, any news? |
@indutny if it doesn't continue to decompress automatically, then how should it work with the synchronous zlib functions? Or have you agreed with the automatic approach? |
Yes, I agreed with it after all. |
Alright, I'll see if I can rebase and refactor this and make a new PR. |
Hi @indutny, I haven't got much time recently. I'd like to pick it up, but I'm short on time. |
It seems this PR makes it an error to have trailing garbage. According to gzip.org, it's not unusual for tar.gz files to be padded with zeroes. Also, I don't see a way to retrieve the decompressed data when using the sync functions if it throws on garbage. On the other hand, Python's implementation raises an exception when there's trailing garbage in gzip (but not zlib) and of course doesn't return anything. But then again, Python's API is terrible anyway in my opinion. How do you think we should approach this? How do other implementations do it? |
Yes, it throws an error, I know that I was reproducing behavior from something else, but can't recall now. |
// OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE | ||
// USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
|
||
// test unzipping a file that was created by contactenating multiple gzip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/contactenating/concatenating/
Some nits but otherwise LGTM. |
Hey, @bnoordhuis, @trevnorris, thank you very much for the comments. I've made all the suggested changes except for adding the documentation. Why ? Because per the RFC this isn't a extra feature, it's something that should have always been there. However, it's amazing to google for this kind of issue and find out that every language's zlib implementation had the same problem at some point. So... if you still think this documentation is necessary, I suggest adding the following paragraph to
|
@trevnorris, @tjfontaine, @indutny ... ping... |
I agree with @luismreis . This is not really a new feature. It's a bug fix. Concatenated Gzips data sets are a valid gzip file and stream. I got caught with this bug when trying to uncompress the web data from http://commoncrawl.org which uses this aspect of the gzip file format to enable indexing into a gzip file. The python code that common-crawl supply supports this form of gzip streaming, I was a little surprised to find that node.js didn't. But a big thanks to @luismreis for this PR! |
@indutny What's your take? I'm alright with the change. |
Sorry for the spam, but, ... @trevnorris, @tjfontaine, @indutny ... ping... |
Yes please! It's now been over 1 year since this original PR was submitted. |
It LGTM last time I looked at it but I can't land it here. I can land it in node-forward but that repo is private for one or two more days. |
LGTM too. |
Overall LGTM. Will this land cleanly on v0.12? |
It should, it was recently rebased. |
Landed in node-forward/node@1183ba4, thank you! |
NO! Thank you! |
Thank you! Will this bug fix be applied to v0.10 as well? |
Reviewed-By: Fedor Indutny <[email protected]> PR-URL: #6442
Landed in 6f6a979 for v0.12. Thanks! |
Thanks @chrisdickinson. Will this bug fix be rebased on to the 0.10 branch too? |
@luismreis @chrisdickinson Should we re-open this PR so that we can fix the regression it caused (see #8962) and land it again in a later milestone? In the meantime, I'm removing it from the 0.11.15 milestone. |
@misterdjules I don't really know what to say... According to the (gzip) spec we should be restarting the gunzip process each time there's more data, but, apparently, npm depends on the behavior of terminating on the first terminating gzip stream. |
No description provided.