option to emit raw string buffers instead of decoded strings #42

dweinstein · 2016-10-03T19:15:03Z

I'm using version 2.6.0 FWIW, node 6.

± node debug bin.js foo.zip
< Debugger listening on [::]:5858
connecting to 127.0.0.1:5858 ... ok
break in bin.js:2
  1
> 2 'use strict'
  3 const extractExec = require('./')
  4 const fs = require('fs')
c
break in index.js:46
 44     // TODO: what if we get multiple plists?
 45     const plist = plists[0]
>46     debugger
 47     getExecStream(fd, plist.CFBundleExecutable, (err, entry, exec) => {
 48       debugger
c
break in index.js:19
 17     zip.on('entry', function onentry (entry) {
 18       if ((/XXXThing.*app\/XXXThing-.*/i).test(entry.fileName)) {
>19         debugger;
 20       }
 21       if (!isOurExec(entry, execname)) { return }
repl
Press Ctrl + C to leave debug repl
> entry.fileName
'Payload/XXXThing-╬▓.app/XXXThing-╬▓'
> execname
'XXXThing-β'

as you can see the execname is right but the entry.fileName is not right utf-8 AFAICT.

The text was updated successfully, but these errors were encountered:

dweinstein · 2016-10-03T19:25:01Z

Here's a reduced testcase:

± unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  10-03-2016 15:16   ç/
        6  10-03-2016 15:16   ç/hello
---------                     -------
        6                     2 files
± node testcase.js
├º/
├º/hello
Error: not found

Testcase code:

'use strict'

const fromFd = require('yauzl').fromFd
const once = require('once')
const fs = require('fs')

// ± unzip -l test.zip
// Archive:  test.zip //   Length      Date    Time    Name
// ---------  ---------- -----   ----
//         0  10-03-2016 15:16   ç/
//         6  10-03-2016 15:16   ç/hello
// ---------                     -------
//         6                     2 files
//
function issue42 (fd, cb) {
  cb = once(cb)
  fromFd(fd, (err, zip) => {
    if (err) return cb(err)
    zip.on('entry', function onentry (entry) {
      if ((/ç\/hello/).test(entry.fileName)) {
        console.log(entry.fileName)
        cb(entry.fileName)
      }
      console.log(entry.fileName)
    })
    zip.on('end', () => {
      if (!cb.called) {
        cb(new Error('not found'))
      }
    })
  })
}

const fd = fs.openSync(__dirname + '/test.zip', 'r')

issue42(fd, (err, res) => {
  console.log(err, res)
})

test zip:

test.zip

thejoshwolfe · 2016-10-05T02:33:09Z

Interesting bug report. The behavior you're seeing from Info-Zip is actually non-standard behavior. yauzl is behaving "correctly" with respect to the zipfile specification.

There are multiple ways for a zipfile to indicate that the filenames are encoded in utf-8, and your zipfile does none of them. According to the spec, if no charset is specified, then cp437 is to be used, which is what yauzl is doing.

I'm not sure why Info-Zip's unzip is making an assumption about the filename being UTF-8. I've read the man page and even spent some time searching the source for the reason for that behavior. The closest I came is an excerpt from the zip man page, which may or may not be relevant:

Though the zip standard requires storing paths in an archive using a specific character set, in practice zips have stored paths in archives in whatever the local character set is.

So the question remains, what should yauzl do in this situation? Should the spec be considered correct, or should "in practice" behavior of popular tools be considered correct? It's a tough call, but I'm leaning toward the spec.

If you'd like to fix your zipfile, try setting general purpose bit 11 in all the entries. That is what yazl does to indicate the filename is to be decoded using utf8. If you're creating the zipfile at a higher level than that, then i suggest using a different library/utility for creating zipfiles, because the one you're using is non-conformant. If you didn't make the zipfile at all, but you got it from a user, then i suggest you forward this paragraph to your user.

I haven't seen general purpose bit 11 mishandled like this in any existing zipfile utility i've tested this with. I can't say for sure, but i believe i've tested this issue with Info-Zip's zip, Windows Compressed Folder, Mac's Archive Utility, and 7-Zip. I'm not as familiar with Java's ZipFile class, python's zipfile module, or WinRAR.

So I don't know how this zipfile came to exist with the filename encoding messed up, but I really don't think I should follow in Info-Zip's nonstandard footsteps on this matter. Following the spec is one of yauzl's design principles, and cp437 support is a feature.

dweinstein · 2016-10-11T01:57:09Z

FWIW the zip was created with zip on a mac.

dweinstein · 2016-10-11T03:18:34Z

@thejoshwolfe would you consider using something like https://gist.github.com/dweinstein/3125bed0a478e2b0acfccfae91c90fd5#file-guess-encoding-js which is a port to javascript of libzip's _zip_guess_encoding ? So far testing has been ok.

I have a branch you can try out here https://github.com/dweinstein/yauzl/tree/guess-encoding -- all tests are passing for me at the moment.

thejoshwolfe · 2016-10-11T03:43:52Z

I would consider adding an option to the open() API (and related API's) that forces all strings in a zipfile to be interpreted as UTF-8. I wouldn't want to do any guessing in this library, because that seems too high-level, error prone, non-standard, etc., but if users simply tell yauzl to deviate from the spec in this regard, that seems like a compromise everyone can be happy with.

Realistically, I'd bet that it's safe to always pass in that flag, if it existed. The only time it would cause a problem is if a zipfile was created with cp437 and actually used the non-ascii part of cp437. It could happen, but it'd probably be a very old zipfile if it ever did.

Does that sound like a viable solution to your problem?

thejoshwolfe · 2016-10-11T12:03:24Z

Better proposal: add an option to open() that leaves all strings undecoded as Buffer objects instead of strings. Then you can use any kind of encoding guesser or assume UTF-8 as you wish. I think this is the right solution to this issue.

dweinstein · 2016-10-11T13:56:02Z

that sounds pretty reasonable. Having the buffer along with the flags surfaced will definitely allow another library to do the guessing...

thejoshwolfe · 2016-10-25T03:32:34Z

published in version 2.7.0

thejoshwolfe added the bug label Oct 5, 2016

thejoshwolfe closed this as completed Oct 5, 2016

thejoshwolfe changed the title ~~fileEntry name encoding issue~~ option to emit raw string buffers instead of decoded strings Oct 11, 2016

thejoshwolfe reopened this Oct 11, 2016

thejoshwolfe added enhancement and removed bug labels Oct 11, 2016

thejoshwolfe closed this as completed in cc7455a Oct 25, 2016

mykola-mokhnach mentioned this issue Jul 4, 2020

Failed to unhide archs in executable file:///private/var/installd/Library/Caches/com.apple.mobile.installd.staging/temp appium/appium#14100

Closed

fpsqdb mentioned this issue Feb 18, 2024

Get the correct fileName from extra filed when decodeStrings is false #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option to emit raw string buffers instead of decoded strings #42

option to emit raw string buffers instead of decoded strings #42

dweinstein commented Oct 3, 2016 •

edited

Loading

dweinstein commented Oct 3, 2016 •

edited

Loading

thejoshwolfe commented Oct 5, 2016

dweinstein commented Oct 11, 2016

dweinstein commented Oct 11, 2016 •

edited

Loading

thejoshwolfe commented Oct 11, 2016

thejoshwolfe commented Oct 11, 2016

dweinstein commented Oct 11, 2016

thejoshwolfe commented Oct 25, 2016

option to emit raw string buffers instead of decoded strings #42

option to emit raw string buffers instead of decoded strings #42

Comments

dweinstein commented Oct 3, 2016 • edited Loading

dweinstein commented Oct 3, 2016 • edited Loading

thejoshwolfe commented Oct 5, 2016

dweinstein commented Oct 11, 2016

dweinstein commented Oct 11, 2016 • edited Loading

thejoshwolfe commented Oct 11, 2016

thejoshwolfe commented Oct 11, 2016

dweinstein commented Oct 11, 2016

thejoshwolfe commented Oct 25, 2016

dweinstein commented Oct 3, 2016 •

edited

Loading

dweinstein commented Oct 3, 2016 •

edited

Loading

dweinstein commented Oct 11, 2016 •

edited

Loading