Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unzipping is too slow #23

Open
aegoroff opened this issue Nov 6, 2018 · 12 comments
Open

Unzipping is too slow #23

aegoroff opened this issue Nov 6, 2018 · 12 comments
Assignees

Comments

@aegoroff
Copy link

aegoroff commented Nov 6, 2018

When i tried to unzip big file (about 3 GiB size in xz and about 18 GiB unpacked) the process was too slow - only 3 GiB of 18 unpacked in about 40 min on my machine. The same file was unpacked for about 5 minutes using 7 zip tool

@aegoroff aegoroff changed the title Unzipping is to slow Unzipping is too slow Nov 6, 2018
@ulikunitz ulikunitz self-assigned this Nov 8, 2018
@ulikunitz
Copy link
Owner

Thank you for reporting. This is expected and I have following language in README.md:

At this time the package cannot compete with the xz tool regarding compression speed and size.

I haven't found the time so far to work on code optimization. On the plus side there is a lot of potential on improving the situation. Unfortunately I cannot promise when I will work on it.

@ulikunitz
Copy link
Owner

There is work ahead. I left the issue open.

@alecthomas
Copy link

I just ran into slow decompression and the (partial) solution is to wrap your reader in bufio.NewReader(). It turns out this library uses ReadByte() a great deal and on unbuffered input this is incredibly slow.

I say "partial" as unfortunately this fails on some inputs with

writeMatch: distance out of range

Very weird that it fails when buffered but works when unbuffered..

@ulikunitz
Copy link
Owner

Yes, the library doesn't implement its own buffering and because it uses ReadByte it benefits from buffered readers. I should have documented it.

Rationale at the time has been that I wanted to use a buffered reader only if there is a need for it. For instance I didn't want to use a buffered reader for a bytes.Buffer.

A buffered reader shouldn't make a difference for the reading process. The gxz tool is using a buffered reader and I have run extensive tests for it.

Can you provide the file that you want to decompress?

@alecthomas
Copy link

alecthomas commented Feb 19, 2021

Sure, I was decompressing the Zig tarballs from here.

@alecthomas
Copy link

Fixed!

@ulikunitz
Copy link
Owner

I have now downloaded all 0.8.0 files and decompressed it with the gxz tool, which uses bufio.Reader and there were no problems to decompress all of them.

Please provide:

  • name of the actual file generating issues
  • version of the xz module
  • the code you are using to decompress the file
  • output of go.env

@alecthomas
Copy link

Oh you're asking for the failing one, sorry, that wasn't clear - I thought you were asking for one of the slow ones.

@alecthomas
Copy link

This is the one that fails. Interestingly it also fails with github.com/xi2/xz

@ulikunitz
Copy link
Owner

ulikunitz commented Feb 20, 2021

Hi, this a deb file, which is an ar file. You must do the following:

$ ar xv bzip2_1.0.6-9.2_deb10u1_amd64.deb 
x - debian-binary
x - control.tar.xz
x - data.tar.xz

The two xz files can easily be uncompressed and generate no issues for me. The debian-binary is a plain-text file. Infos about the deb format can be found by the manual page for deb.

anatol added a commit to anatol/booster that referenced this issue Mar 17, 2021
There are 2 xz golang libraries:

* https://github.com/xi2/xz fast but provides Reader functionality only, currently used to unpack modules
* https://github.com/ulikunitz/xz has Writer() but Reader path is slower ulikunitz/xz#23
  use it for image compression
anatol added a commit to anatol/booster that referenced this issue Mar 17, 2021
There are 2 xz golang libraries:

* https://github.com/xi2/xz fast but provides Reader functionality only, currently used to unpack modules
* https://github.com/ulikunitz/xz has Writer() but Reader path is slower ulikunitz/xz#23
  use it for image compression

Closes #42
chrisnovakovic added a commit to chrisnovakovic/arcat that referenced this issue Nov 30, 2022
The performance of `github.com/ulikunitz/xz` when decompressing xz data
is a known limitation; see ulikunitz/xz#23.
`github.com/xi2/xz` is significantly faster for xz decompression; use it
in place of `github.com/ulikunitz/xz` in the `unzip` package.
`github.com/xi2/xz` doesn't implement xz compression, so the `tar`
package must continue to use `github.com/ulikunitz/xz`.

Performance evaluation on a sample 576MB xz-compressed tarball (the
binary distribution of Clang for Ubuntu 18.04) with a dictionary size
of 64MB (which corresponds to compression preset level 9) and a
resulting ~13% compression ratio:

```
bash-4.4$ ls -lh $SRCS
-rw-r--r-- 2 csn users 576M Nov 19 17:54 clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz

bash-4.4$ xz -lvv $SRCS
clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz (1/1)
  Streams:            1
  Blocks:             1
  Compressed size:    575.8 MiB (603,776,352 B)
  Uncompressed size:  4,408.2 MiB (4,622,376,960 B)
  Ratio:              0.131
  Check:              CRC64
  Stream padding:     0 B
  Streams:
    Stream    Blocks      CompOffset    UncompOffset        CompSize      UncompSize  Ratio  Check      Padding
         1         1               0               0     603,776,352   4,622,376,960  0.131  CRC64            0
  Blocks:
    Stream     Block      CompOffset    UncompOffset       TotalSize      UncompSize  Ratio  Check      CheckVal          Header  Flags        CompSize    MemUsage  Filters
         1         1              12               0     603,776,312   4,622,376,960  0.131  CRC64      b4d869416c7f940f      12  --        603,776,291      65 MiB  --lzma2=dict=64MiB
  Memory needed:      65 MiB
  Sizes in headers:   No
  Minimum XZ Utils version: 5.0.0
```

With GNU tar 1.29 and liblzma 5.2.2 (a useful baseline):

```
bash-4.4$ time tar xf $SRCS

real    0m40.250s
user    0m36.544s
sys     0m6.847s
```

arcat with `github.com/ulikunitz/xz` handling xz decompression:

```
bash-4.4$ time $TOOLS_ARCAT x $SRCS

real    12m6.254s
user    4m6.769s
sys     8m4.628s
```

arcat with `github.com/xi2/xz` handling xz decompression:

```
bash-4.4$ time $TOOLS_ARCAT x $SRCS

real    0m55.643s
user    0m50.877s
sys     0m2.275s
```
Tatskaari added a commit to please-build/arcat that referenced this issue Nov 30, 2022
The performance of `github.com/ulikunitz/xz` when decompressing xz data
is a known limitation; see ulikunitz/xz#23.
`github.com/xi2/xz` is significantly faster for xz decompression; use it
in place of `github.com/ulikunitz/xz` in the `unzip` package.
`github.com/xi2/xz` doesn't implement xz compression, so the `tar`
package must continue to use `github.com/ulikunitz/xz`.

Performance evaluation on a sample 576MB xz-compressed tarball (the
binary distribution of Clang for Ubuntu 18.04) with a dictionary size
of 64MB (which corresponds to compression preset level 9) and a
resulting ~13% compression ratio:

```
bash-4.4$ ls -lh $SRCS
-rw-r--r-- 2 csn users 576M Nov 19 17:54 clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz

bash-4.4$ xz -lvv $SRCS
clang+llvm-14.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz (1/1)
  Streams:            1
  Blocks:             1
  Compressed size:    575.8 MiB (603,776,352 B)
  Uncompressed size:  4,408.2 MiB (4,622,376,960 B)
  Ratio:              0.131
  Check:              CRC64
  Stream padding:     0 B
  Streams:
    Stream    Blocks      CompOffset    UncompOffset        CompSize      UncompSize  Ratio  Check      Padding
         1         1               0               0     603,776,352   4,622,376,960  0.131  CRC64            0
  Blocks:
    Stream     Block      CompOffset    UncompOffset       TotalSize      UncompSize  Ratio  Check      CheckVal          Header  Flags        CompSize    MemUsage  Filters
         1         1              12               0     603,776,312   4,622,376,960  0.131  CRC64      b4d869416c7f940f      12  --        603,776,291      65 MiB  --lzma2=dict=64MiB
  Memory needed:      65 MiB
  Sizes in headers:   No
  Minimum XZ Utils version: 5.0.0
```

With GNU tar 1.29 and liblzma 5.2.2 (a useful baseline):

```
bash-4.4$ time tar xf $SRCS

real    0m40.250s
user    0m36.544s
sys     0m6.847s
```

arcat with `github.com/ulikunitz/xz` handling xz decompression:

```
bash-4.4$ time $TOOLS_ARCAT x $SRCS

real    12m6.254s
user    4m6.769s
sys     8m4.628s
```

arcat with `github.com/xi2/xz` handling xz decompression:

```
bash-4.4$ time $TOOLS_ARCAT x $SRCS

real    0m55.643s
user    0m50.877s
sys     0m2.275s
```

Co-authored-by: jpoole <[email protected]>
@mark-summerfield
Copy link

mark-summerfield commented Aug 19, 2023

I used xz to unpack Python-3.11.4.xz. Using Python 3.10 it took 4sec; using Go it took 1m55sec. So I do think Go xz has a speed issue.

I just tried github.com/therootcompany/xz and it took 5sec.

@ghost
Copy link

ghost commented Aug 19, 2023

I posted this two years ago but it got deleted. here is it again. should help with the speed:

package test

import (
   "archive/tar"
   "bufio"
   "github.com/ulikunitz/xz"
   "io"
   "os"
   "path"
   "testing"
)

const cargo = "cargo-1.54.0-x86_64-pc-windows-gnu.tar.xz"

func readFrom(r io.Reader) error {
   tr := tar.NewReader(r)
   for {
      n, err := tr.Next()
      if err == io.EOF {
         break
      } else if err != nil {
         return err
      } else if n.Typeflag != tar.TypeReg {
         continue
      }
      os.MkdirAll(path.Dir(n.Name), os.ModeDir)
      f, err := os.Create(n.Name)
      if err != nil {
         return err
      }
      defer f.Close()
      f.ReadFrom(tr)
   }
   return nil
}

func TestUlikunitz(t *testing.T) {
   f, err := os.Open(cargo)
   if err != nil {
      t.Fatal(err)
   }
   defer f.Close()
   r, err := xz.NewReader(bufio.NewReader(f))
   if err != nil {
      t.Fatal(err)
   }
   if err := readFrom(r); err != nil {
      t.Fatal(err)
   }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
@alecthomas @mark-summerfield @aegoroff @ulikunitz and others