[performance] beating git in `git verify-pack` ✅ 🚀 #47

Byron · 2020-06-29T06:53:11Z

Byron
Jun 29, 2020
Maintainer

When trying to beat `git verify-pack`

Attempt 1

I remembered timings on a cold cache that indicated something around 5:50min for git to run a verify pack on the linux kernel pack. However, turns out that if the environment is a little more controlled, git is still considerably faster than us despite using an LRU cache and despite using multiple cores quite efficiently.

Observation

Git uses a streaming pack approach which is optimized to apply objects inversely. It works by

decompressing all deltas
applying all deltas that depend on a base, recursively (and thus avoiding to have to decompress deltas multiple times)

We work using a memory mapped file which is optimized for random access, but won't be very fast for this kind of workload.

How to fix

Wait until we have implemented a streaming pack as well and try again, having the same algorithmical benefits possibly faired with more efficient memory handling.
Git for some reason limits the application to 3 threads, even though we do benefit from having more threads so could be faster just because of this.
The streaming (indexing) phase of reading a pack can be parallelised in case we have a pack on disk, and it should be easy to implement if the index datastructure itself is threadsafe (but might not be worth the complexity or memory overhead, let's see).

Byron · 2020-07-20T13:17:45Z

Byron
Jul 20, 2020
Maintainer Author

Measurements with the lookup based algorithm, constrained to three threads for fairness. Lookup is clearly slower, but thanks to LRU cache not nearly as bad as it could be given the enormous amount of additional work that it does.

0 replies

Byron · 2020-07-21T06:38:44Z

Byron
Jul 21, 2020
Maintainer Author

note that the statistics in the image above show delta chain length of 1 at most, as we have a 'perfect' cache due to the tree-like delta traversal.

We are now faster than git for the very first time, a win that didn't come easily and certainly isn't perfect: git uses less system calls, and doesn't seem to rely visibly on memory mapping packs for random access. Git seems to use way less memory because of that.

However, our implementation scales perfectly and isn't artificially limited to three threads (unless this is somewhat covertly overridden by the pack.threads config variable). Git will not scale that well with more cores, and the artificial restriction to three doesn't seem too far off. Below the first call to git is using three threads as default, the other one uses four, the amount of logical cores in the machine.

Since implementing new variants on how to do this kind of work quickly is fairly straightforward, this might not even be the end of it.

0 replies

Byron · 2020-07-21T08:05:22Z

Byron
Jul 21, 2020
Maintainer Author

@joshtriplett If you feel like setting a new speed record, you could try running the latest version against the kernel pack (and publish your results here :) ). This time, and with three threads, the Rust version seems to be equal or a little faster, gaining more and more of an advantage the more cores you add. Git does indeed not scale well and gets slower beyond three cores.

Now there is a -t <num-threads> flag which makes experimentation easy.

(My current plan is to eventually post a link to this issue on reddit)

0 replies

Byron · 2020-07-21T08:11:48Z

Byron
Jul 21, 2020
Maintainer Author

A few more tests show that the less-time algorithm doesn't always scale better at least on my machine and without meeting any standards when it comes to testing. The less-memory algorithm, the one that was there initially, fares even worse though which makes sense as its cache is far less efficient.

My expectation would nonetheless be that it will scale well with more cores.

0 replies

Byron · 2020-07-21T08:19:46Z

Byron
Jul 21, 2020
Maintainer Author

Interestingly, with more threads than logically available in the machine, there is still some gain to be had - 6 threads on a 4 core machine seems to yield the best results, allowing my machine to crack the 1GB mark!

Git with three threads looks like this:

0 replies

Byron · 2020-07-21T23:29:05Z

Byron
Jul 21, 2020
Maintainer Author

According to this…

…git-oxide is now 27.6/19.5 = 1.41x faster than git in this particular task with three threads, whereas it's 27.6/16.0 = 1.72x faster with 4 threads. And it could scale with more cores.

I wonder what the difference in algorithm is that causes git to use less memory (or not rely on memory maps) but fail to scale.

0 replies

joshtriplett · 2020-07-22T00:11:11Z

joshtriplett
Jul 22, 2020

Just tested. I made sure the packfile and the index were in cache first; cached reads of them off disk take about 0.23s and 0.04s, respectively.

Is giop verify-pack directly comparable to git index-pack --verify? Are you doing the equivalent of git's Indexing objects phase? To what extent does git rely on the existing index, and to what extent does git-oxide?

git's Indexing objects phase always seems to be single-threaded. Its Resolving deltas phase depends on pack.threads, but it definitely has scalability limits; when I ran it with a full 96 threads it went slower, with htop showing relatively little CPU activity. git with fewer threads did a better job of saturating the CPU with each thread. Results with git:

~/linux/.git/objects/pack$ time git index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	2m41.735s
user	6m4.195s
sys	0m13.454s
~/linux/.git/objects/pack$ time git -c pack.threads=96 index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	3m40.570s
user	10m18.655s
sys	1m14.089s
~/linux/.git/objects/pack$ time git -c pack.threads=16 index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	1m46.341s
user	6m35.325s
sys	0m27.599s
~/linux/.git/objects/pack$ time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	1m51.208s
user	6m23.465s
sys	0m24.761s

Note on git-oxide: its help says it can take either a .pack file or a .idx file, but when run on a .pack it doesn't seem to do anything visible except read the pack. To get anything interesting I had to run it on the index.

I also found that with 96 threads, the less-memory algorithm was much faster; it uses more user time, but finishes in half the real time, and pulls off more than 10GB/s.

Results with git-oxide:

$ time ~/git-oxide/target/release/giop verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:47:53 Sha1 of index finished in 0.35s at 592.5 MB/s
 23:47:55  Sha1 of pack finished in 2.34s at 567.1 MB/s
 23:47:56 collecting sorted index in 1.38s (5445749.5 entries/s)                
 23:47:58                indexing tree from 7492721 entries in 2.25s (3329364.5 entries /s)
 23:48:18                Checking finished                                      
 23:48:18                Checking Verified 7492721 objects in 19.87s (377108 objects/s, ~4.7 GB/s)

real	0m26.000s
user	5m22.839s
sys	2m27.790s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:48:48 Sha1 of index finished in 0.35s at 592.7 MB/s
 23:48:50  Sha1 of pack finished in 2.34s at 568.1 MB/s
 23:48:52 collecting sorted index in 1.37s (5476950 entries/s)                  
 23:49:01                Checking finished                                      
 23:49:01                Checking Verified 7492721 objects in 8.86s (845661 objects/s, ~10.6 GB/s)

real	0m12.654s
user	7m30.161s
sys	0m23.244s

You've definitely got a scaling problem in the default less-time algorithm, because with 48 threads instead of 96, it finishes faster:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 48 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:51:50 Sha1 of index finished in 0.35s at 591.6 MB/s
 23:51:52  Sha1 of pack finished in 2.33s at 569.2 MB/s
 23:51:54 collecting sorted index in 1.36s (5495393 entries/s)                  
 23:51:56                indexing tree from 7492721 entries in 2.25s (3325254.8 entries /s)
 23:52:12                Checking finished                                      
 23:52:12                Checking Verified 7492721 objects in 15.63s (479405 objects/s, ~6.0 GB/s)

real	0m21.718s
user	5m3.554s
sys	1m4.436s

By contrast, the less-memory algorithm ran just barely slower with 48 threads; that suggests a scaling problem as well, but not quite as bad of one:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 48 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:52:22 Sha1 of index finished in 0.35s at 591.9 MB/s
 23:52:24  Sha1 of pack finished in 2.33s at 569.7 MB/s
 23:52:25 collecting sorted index in 1.37s (5477936.5 entries/s)                
 23:52:36                Checking finished                                      
 23:52:36                Checking Verified 7492721 objects in 11.15s (671699 objects/s, ~8.4 GB/s)

real	0m14.904s
user	6m19.774s
sys	0m6.795s

Results at 24 threads:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 24 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:54:30 Sha1 of index finished in 0.35s at 592.9 MB/s
 23:54:32  Sha1 of pack finished in 2.32s at 573.0 MB/s
 23:54:34 collecting sorted index in 1.37s (5478507.5 entries/s)                
 23:54:36                indexing tree from 7492721 entries in 2.24s (3344277.8 entries /s)
 23:54:53                Checking finished                                      
 23:54:53                Checking Verified 7492721 objects in 16.98s (441148 objects/s, ~5.5 GB/s)

real	0m23.030s
user	4m11.169s
sys	0m26.875s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 24 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:55:08 Sha1 of index finished in 0.35s at 593.0 MB/s
 23:55:10  Sha1 of pack finished in 2.31s at 574.4 MB/s
 23:55:11 collecting sorted index in 1.37s (5477653 entries/s)                  
 23:55:23                Checking finished                                      
 23:55:23                Checking Verified 7492721 objects in 12.01s (624009 objects/s, ~7.8 GB/s)

real	0m15.725s
user	4m26.295s
sys	0m3.535s

12 threads:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 12 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:56:21 Sha1 of index finished in 0.35s at 592.0 MB/s
 23:56:23  Sha1 of pack finished in 2.32s at 571.7 MB/s
 23:56:24 collecting sorted index in 1.37s (5449349 entries/s)                  
 23:56:26                indexing tree from 7492721 entries in 2.25s (3324838.5 entries /s)
 23:56:50                Checking finished                                      
 23:56:50                Checking Verified 7492721 objects in 23.71s (316049 objects/s, ~4.0 GB/s)

real	0m29.783s
user	4m3.463s
sys	0m20.722s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 12 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:56:58 Sha1 of index finished in 0.35s at 591.7 MB/s
 23:57:00  Sha1 of pack finished in 2.32s at 571.7 MB/s
 23:57:01 collecting sorted index in 1.38s (5446238 entries/s)                  
 23:57:23                Checking finished                                      
 23:57:23                Checking Verified 7492721 objects in 21.85s (342929 objects/s, ~4.3 GB/s)

real	0m25.580s
user	4m23.070s
sys	0m2.476s

And 6 threads:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 6 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:57:49 Sha1 of index finished in 0.35s at 592.1 MB/s
 23:57:51  Sha1 of pack finished in 2.33s at 569.9 MB/s
 23:57:53 collecting sorted index in 1.38s (5434567.5 entries/s)                
 23:57:55                indexing tree from 7492721 entries in 2.25s (3328975 entries /s)
 23:58:36                Checking finished                                      
 23:58:36                Checking Verified 7492721 objects in 41.31s (181356 objects/s, ~2.3 GB/s)

real	0m47.386s
user	4m2.047s
sys	0m16.788s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 6 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:58:42 Sha1 of index finished in 0.35s at 592.5 MB/s
 23:58:44  Sha1 of pack finished in 2.31s at 574.0 MB/s
 23:58:46 collecting sorted index in 1.37s (5449867.5 entries/s)                
 23:59:28                Checking finished                                      
 23:59:28                Checking Verified 7492721 objects in 42.72s (175372 objects/s, ~2.2 GB/s)

real	0m46.437s
user	4m18.107s
sys	0m2.161s

Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking?

Also, the --progress display gave some interesting information. With the less-memory algorithm, I noticed that the last couple of seconds were spent on a few threads finishing up their chunk of work, with all other threads done; if you did some kind of work-stealing, could other threads help the last ones finish up? With the less-time algorithm, I didn't understand exactly what the progress display was showing, but it still had several seconds of stragglers, and in addition, I saw something like this for several seconds at the beginning:

┌gitoxide──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 'q' or CTRL+c to quit   97 running +   0 blocked +   1 groups = 98 ─┐
├ verify-pack    ├ verify-pack ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ └ Checking     │ └ 243333 / 7492721 objects                                                                                                                                                                                                                                                                                 │
│   ├ thread 1   │   ├ 0 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 0   │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 3   │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 4   │   ├ 2 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 5   │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 2   │   ├ 0 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 6   │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 7   │   ├ 3 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 9   │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 10  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 8   │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 11  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 12  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 13  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 14  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 15  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 16  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 17  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 18  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 19  │   ├ 0 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 20  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 21  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 22  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 23  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 24  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 25  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 26  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 27  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 28  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 29  │   ├ 0 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 30  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 31  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 32  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 33  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 34  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 35  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 36  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 37  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 38  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 39  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 40  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 41  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 42  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 43  │   ├ 0 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 44  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 45  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 46  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 47  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 48  │   ├ 2 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 49  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 50  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 51  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 52  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 53  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 54  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 55  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 56  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 57  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 58  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 59  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 60  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 61  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 62  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 63  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 64  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 65  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 66  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 67  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 68  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 69  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│   ├ thread 70  │   ├ 1 entries                                                                                                                                                                                                                                                                                              │
│                │ …0 skipped and 24 more                                                                                                                                                                                                                                                             ⇊ = d|↓ = j|⇈ = u|↑ = k │
│Messages─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ⨯ = `| ▢ = ~ │
│0:04:19│ info│indexing               → tree from 7492721 entries in 2.26s (3317468 entries /s)                                                                                                                                                                                                                               │
│0:04:17│ info│collecting sorted index→ in 1.36s (5489628.5 entries/s)                                                                                                                                                                                                                                                        │
│0:04:15│ done│Sha1 of pack           → finished in 2.31s at 574.8 MB/s                                                                                                                                                                                                                                                       │
│0:04:13│ done│Sha1 of index          → finished in 0.35s at 592.6 MB/s                                                                                                                                                                                                                                                       │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

A bunch of threads with 0-2 entries looks like a problematic bottleneck.

Hope that helps.

0 replies

Byron · 2020-07-22T14:07:50Z

Byron
Jul 22, 2020
Maintainer Author

Thanks so much for taking the time! Very informative, and very valuable information…! Right now I wouldn't know if this can get any faster, or if at some point the memory subsystem of the machine is simply overloaded if too many threads compete. It's accessing the memory mapped pack, decompresses the data and computes SHA1 and CRC32 on all entries, and without debugging on such a machine I doubt any further optimizations can be done reliably.
However, personally I am already quite happy with the result, and verifying the linux kernel pack in 15s is… a record :D?

My answers

Are you doing the equivalent of git's Indexing objects phase?

At least that's what I think, more or less. The idea is to build a tree of deltas to allow traversing the chain from base to delta, and not from delta to base. This allows for a cache implementation that is perfect, i.e. a delta can be applied to it's base from cache all the time.

To what extent does git rely on the existing index, and to what extent does git-oxide?

Without having studied git's code in detail, it looks like it decompresses objects while indexing them, storing some of them in memory. To me it seems it doesn't need an index for this to work.
git-oxide also doesn't need an index, but I use it to speed up the indexing operation itself. Furthermore, the index is used for obtaining the desired SHA1s and CRC32 to compare each pack object to.

Note on git-oxide: its help says it can take either a .pack file or a .idx file, but when run on a .pack it doesn't seem to do anything visible except read the pack. To get anything interesting I had to run it on the index.

Yes, related to the above. A valid pack always has an index and I think it's beneficial to make use of it. The reason git doesn't do it to that extend might be that git index-pack is primarily used (and maybe I am totally wrong :D) for receiving packs, which don't come with an index as far as I know as it can be generated from a pack which is protected by a trailing SHA1 over all its content.

With the less-memory algorithm, I noticed that the last couple of seconds were spent on a few threads finishing up their chunk of work, with all other threads done; if you did some kind of work-stealing, could other threads help the last ones finish up?

Indeed, the parallelisation is done using a super simple fan-in-fan-out map reduce using nothing more than std threads and channels. In a way, this already is work stealing, but it can only steal entire chunks of work. Right now, the chunk size is rather big, and the smaller it is, the more one pays for the channel synchronisation. Probably some tuning is possible there. Maybe it's worth trying to have chunks as small as 10 (objects) - for all I know having one object (without chunking) is visibly slowing it down due to sync overhead.
As a requirement, it must be trivial to compile without parallelism at all, and to decide at runtime whether to use threads or not. Right now, threads are only used for packs above a certain (magic, really) size.

0 replies

Byron · 2020-07-22T14:14:12Z

Byron
Jul 22, 2020
Maintainer Author

Regarding the less-memory algorithm actually being faster than less-time with many threads: that's an interesting finding, and probably has to do with chunk sizes. These are potentially small with the less-time version as they depend on the delta tree itself. With the less-memory version, the pack is chunked up quite roughly into 1000 chunks or so, giving each thread a lot of work without having to content for work.

Maybe using a different channel implementation can speed this up - right now crossbeam-channel is used, and an alternative could be flume.

0 replies

Byron · 2020-07-23T04:14:24Z

Byron
Jul 23, 2020
Maintainer Author

I have played around with the relevant variables a little and came up with a better chunk size (read: higher) chunk size for the 'less-time' algorithm, while opening the code up a little for smarter heuristics as time passes. Ideally, it finds chunk sizes that keep threads busy for no more than around a second - that way starvation won't really be noticeable, while keeping the syncronization cost low.

0 replies

Byron · 2020-07-23T04:15:04Z

Byron
Jul 23, 2020
Maintainer Author

Closing this issue as git-oxide clearly beats git in this discipline :).

0 replies

Byron · 2020-07-23T04:59:55Z

Byron
Jul 23, 2020
Maintainer Author

@joshtriplett

Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking?

It could be done, certainly if optimizing for speed is the goal. However, I thought it's not necessary to start checking if the SHA1 of the pack already turns out to be wrong.

Maybe there could be a flag to simply skip the Sha1 checks and proceed with the per-object checks.

Something like that could certainly be added though if it should get a little faster overall.

0 replies

joshtriplett · 2020-07-25T04:04:14Z

joshtriplett
Jul 25, 2020

On Wed, Jul 22, 2020 at 10:00:09PM -0700, Sebastian Thiel wrote: @joshtriplett > Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking? It could be done, certainly if optimizing for speed is the goal. However, I thought it's not necessary to start checking if the SHA1 of the pack already turns out to be wrong. Maybe there could be a flag to simply skip the Sha1 checks and proceed with the per-object checks. Something like that could certainly be added though if it should get a little faster overall.

I'm less concerned about the performance of verify-pack in particular, and more thinking that in general git-oxide should try to do as much as possible in parallel.

0 replies

Byron · 2020-08-06T02:31:52Z

Byron
Aug 6, 2020
Maintainer Author

With further optimizations, I now get down to 90s for verifying the kernel pack, compared to three minutes for git. The same data structure is used for receiving/indexing packs too, which allows to clone faster than git, with gitoxide (see #5 )

The above time was achieved with four threads, as opposed to three that git uses by default.
Once more with 3 threads is just a little slower, but still nearly twice as fast as git.

For comparison, here is the same task with git.

I think one major boost gitoxide allows itself to use is the index file - thank to that, building the in-memory index is done in less than 10 seconds, whereas otherwise it would take about 80seconds (in fact, git is faster in that particular task).

My reasoning for doing it like this is that pack-verify is certainly used as part of a complete repository check, so having an index can be assumed. If an index is not present, one can be built (even from partial packs) using gixp index-for-pack.

0 replies

joshtriplett · 2020-08-07T05:45:16Z

joshtriplett
Aug 7, 2020

@Byron Nice! I would love to see a faster clone or fetch, as well as push.

0 replies

joshtriplett · 2020-08-12T04:32:38Z

joshtriplett
Aug 12, 2020

This is incredibly impressive scalability now.

$ time ~/gitoxide-inst/bin/gixp pack-verify pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 

real	0m10.606s
user	6m17.265s
sys	0m16.748s

You may want to rate-limit the --verbose output, because it now adds a non-trivial fraction of the total time:

$ time ~/gitoxide-inst/bin/gixp --verbose pack-verify pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 
 4:11:15 SHA1 of index done 212.8MB in 0.39s (549.5MB/s)
 4:11:16 collecting sorted index done 7.6M entries in 1.47s (5.2M entries/s)    
 4:11:17            SHA1 of pack done 1.4GB in 2.41s (564.2MB/s)                
 4:11:18                indexing done 7.6M objects in 2.29s (3.3M objects/s)    
 4:11:25               Resolving done 7.6M objects in 7.13s (1.1M objects/s)    
 4:11:25                Decoding done 95.6GB in 7.13s (13.4GB/s)                

real	0m11.212s
user	6m19.052s
sys	0m14.940s

Watching htop, it looks like across those 10s, the middle 4-5 seconds have all 96 CPUs completely saturated (nicely done), while the first and last few seconds have just a few threads running. The --progress display shows that for the last couple of seconds, just a few resolver threads are finishing up their batches.

With the less-memory algorithm, it breaks the 10s barrier now, albeit by using much more CPU time in the process:

$ time ~/gitoxide-inst/bin/gixp pack-verify --algorithm less-memory  pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 

real	0m9.244s
user	10m13.571s
sys	0m12.382s
$ time ~/gitoxide-inst/bin/gixp --verbose pack-verify --algorithm less-memory  pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 
 4:18:25 SHA1 of index done 212.8MB in 0.39s (543.7MB/s)
 4:18:26 collecting sorted index done 7.6M entries in 1.48s (5.1M entries/s)    
 4:18:28            SHA1 of pack done 1.4GB in 3.29s (414.2MB/s)                
 4:18:34              Traversing of 7600359 objects done in 7.89s (962996 objects/s, ~12.1 GB/s)

real	0m9.465s
user	10m13.364s
sys	0m12.523s

Pack indexing:

$ time ~/gitoxide-inst/bin/gixp index-from-pack -p pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.pack 
index: 8b434ea70e1bebaa529500824274c0154be903ac
pack: 22ea4a9baa332d5aee13a63a3c86e7bd76eee97e

real	0m46.196s
user	7m0.051s
sys	0m6.478s
$ time ~/gitoxide-inst/bin/gixp --verbose index-from-pack -p pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.pack 
 4:23:37 read pack done 1.4GB in 36.58s (37.2MB/s)                              
 4:23:37  indexing done 7.6M objects in 36.58s (207.8k objects/s)               
 4:23:37 decompressing done 2.7GB in 36.58s (74.3MB/s)                          
 4:23:44     Resolving done 7.6M objects in 6.72s (1.1M objects/s)              
 4:23:44      Decoding done 95.6GB in 6.72s (14.2GB/s)                          
 4:23:47 writing index file done 212.8MB in 0.57s (376.5MB/s)                   
 4:23:47  create index file done 7.6M objects in 46.19s (164.5k objects/s)      
index: 8b434ea70e1bebaa529500824274c0154be903ac
pack: 22ea4a9baa332d5aee13a63a3c86e7bd76eee97e

real	0m46.311s
user	7m3.584s
sys	0m5.409s

You're right that it spends most of its time limited by decompression, then spends a few seconds using many CPUs, then finishes.

Regarding decompression performance, try replacing miniz_oxide with a better zlib decoder. Build with libz-sys, and then try substituting zlib-ng built with --zlib-compat. (I'm working on making that easier.) That should substantially improve decompression.

Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel?

0 replies

joshtriplett · 2020-08-12T06:04:39Z

joshtriplett
Aug 12, 2020

Quick informal benchmarks of DEFLATE implementations

For test data, I tarred up the contents of /usr/bin on my system, which produced a 462356480 byte tar file. I compressed that with gzip, and got a 160945647 byte tar.gz file. I used the following quick-and-dirty test code, based on the flate2 crate. (Note that it always decompresses the same data, rather than decompressing the data it compressed, to reduce variability. Also, I only tested the default compression level here, for a quick test.)

use std::io::Read;
use std::time::Instant;

use flate2::Compression;
use flate2::bufread::{GzDecoder, GzEncoder};

fn main() -> anyhow::Result<()> {
    let mut dest = Vec::with_capacity(1024*1024*1024);

    let uncompressed_data = std::fs::read("/tmp/usr-bin.tar")?;
    let time = Instant::now();
    let mut gz = GzEncoder::new(uncompressed_data.as_slice(), Compression::default());
    gz.read_to_end(&mut dest)?;
    let elapsed = time.elapsed();
    eprintln!("Compressed {} bytes to {} bytes in {:?}", uncompressed_data.len(), dest.len(), elapsed);

    dest.clear();

    let compressed_data = std::fs::read("/tmp/usr-bin.tar.gz")?;
    let time = Instant::now();
    let mut gz = GzDecoder::new(compressed_data.as_slice());
    gz.read_to_end(&mut dest)?;
    let elapsed = time.elapsed();
    eprintln!("Decompressed {} bytes to {} bytes in {:?}", compressed_data.len(), dest.len(), elapsed);

    Ok(())
}

For each test, I compiled flate2 with the appropriate feature flag.

miniz (default pure-Rust implementation)

Compressed 462356480 bytes to 161332179 bytes in 25.073941226s
Decompressed 160945647 bytes to 462356480 bytes in 1.802195335s

miniz-sys (`miniz.c`)

Compressed 462356480 bytes to 161332179 bytes in 24.158530166s
Decompressed 160945647 bytes to 462356480 bytes in 2.008726615s

zlib (system zlib on Debian, package version `1:1.2.11.dfsg-2`)

Compressed 462356480 bytes to 160771801 bytes in 21.822209389s
Decompressed 160945647 bytes to 462356480 bytes in 1.809420268s

cloudflare_zlib (Note: major portability issues, only works on modern systems)

Compressed 462356480 bytes to 164343457 bytes in 11.124091826s
Decompressed 160945647 bytes to 462356480 bytes in 1.782134224s

zlib-ng (uses CPU feature detection; tested by compiling for zlib and setting `LD_LIBRARY_PATH`)

Compressed 462356480 bytes to 165659015 bytes in 9.892405882s
Decompressed 160945647 bytes to 462356480 bytes in 1.520923279s

Summary

Use zlib-ng. It's portable, and it's faster at both compression and decompression. I plan to make it easier to use from Rust. For now, I'd suggest using zlib, which makes it easy to test other implementations via LD_LIBRARY_PATH.

0 replies

Byron · 2020-08-12T06:27:37Z

Byron
Aug 12, 2020
Maintainer Author

Whoop whoop, this feels like being knighted, and all the work totally paid off. Iteration after iteration after iteration, and it really was just Rust pushing me into doing "what's right", which also happens to be fast as a freebie.

You should consider making a static copy of that pack and index available somewhere, to allow reproducible tests rather than testing with the latest Linux pack each time.

I will try, is there a platform you would recommend for that? Thus far I was unable to obtain the tiny kernel pack, 1.4GB only, as the clones would always abort midway given my incredible 6kb/s download speed. Alternatively, if you have an easy way to share that file, I would be happy to give downloading that a try.

You may want to rate-limit the --verbose output, because it now adds a non-trivial fraction of the total time:

To me it looks like it's still a tiny fraction, but I might misread what is shown here. My disposition for this one is to leave it, as I am happy with the small overhead, and that turning it off works so well. It's all monomorphic and the branch predictor certainly has an easy time seeing that this thing is always off or always on :D.
It's kind of incredible that 96 threads are pounding that poor dashmap and it's all working smoothly. prodash was created to support this kind of load without threats having to limit themselves, so maybe there is something prodash can do to make progress lookups even faster. The lookup time is probably what's costing here, and in my single-threaded tests it got 50million progress updates per second, and all threads together just make 7.5 million in ~7s.

Watching htop, it looks like across those 10s, the middle 4-5 seconds have all 96 CPUs completely saturated (nicely done), while the first and last few seconds have just a few threads running. [emphasis mine]

This all comes down to the chunk size, which right now is static and computed using a fantastic algorithm that probably should be done by someone who knows math. As the total of chunks can be recomputed I could imagine one know that in the last N chunks (N = num cores), we have to split them smaller to feed all threads. If that is implemented, I could imagine this ~6.2 number to go down to ~5.2 even. Definitely something to look into. Maybe as a first simple option, there could be a 'chunk-size' flag to override the batch size to tune it based on experience. Maybe that way it's already possible to squeeze out another half a second or so.

It's odd though that the first few seconds there isn't enough work available for all threads to start right away, and might indicate that the single-threaded producer is not fast enough.

With the less-memory algorithm, it breaks the 10s barrier now, albeit by using much more CPU time in the process:

It's great this one also got faster, even though the core algorithm is definitely unchanged. Probably it gets started more quickly now because the pack verification is done in parallel with everything else - thanks again for giving me the nudge in that direction!
This one probably won't get any faster as it relies entirely on the LRU cache with 64 entries per core for performance. It traverses the pack on a sorted list of pack offsets, ascending, which yields the best hit rate at low memory cost and quite fast preparation.
With a bigger and more performant LRU, there could be some gains here as well, but care should be taken to not make the 'less-memory' algorithm into a 'more-memory' one :D.
Whenever the cache is missed, objects have to be decompressed again, increasing the amount of unnecessary work done compared to the 'perfect' algorithm that is the less-time one. This is how it gets to burn the CPU more.

Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel?

Yes, but only if you know where these streams are. Right now the only way to find out (when only looking at the pack file) is to set the decompressor past the header of a pack entry and let it decompress till done. This marks the beginning of the next entry. If it would somehow be possible to find entries more quickly, an index can be built much faster. Index traversal already does limited only by sequential/burst IO speeds by skipping through the pack to read entry headers only. It can do that as it know where each entry begins though (and it's faster than seeking :D).
I was thinking that maybe there is a way pattern match bytes with some probability and find beginnings or ends of zlib streams, without the chance for ever missing an entry. If that was the case, one could probably screen the pack at IO speeds. A fun fact: nobody actually wants to decompress the objects during indexing - also gitoxide just decompresses them to learn the beginning of the next entry, the decompression result is thrown away. And I did have a version that could keep these results in memory, but it was simply not worth it as the cost in memory is too grand at 7.5 million objects, and for some reason I never got an inkling of additional performance when doing that.

Since you seem to be involved with zlib much more than I am, then let me restate that finding the beginning or end of a zlib stream reliably without decompressing it would be a major thing for clone speeds of git.

And while writing this, the quick zlib comparison came in: Thanks so much! Switching to zlib-ng will be very impactful!
For now I will not try it to leave you a chance to make it easier to use from Rust. Thus far I managed to avoid pulling in Flate2 to maintain better control and reduce compile times, putting the relatively little control code on top to support streaming and 'all-at-once' decompression. That makes it a bit harder to try new backends, but it's definitely something worth investing time in, as we might be looking at 18.5% faster decompression, while getting a 2.5x boost for compression.
Please note that it's also worth optimizing for keeping the time to reset the decompressor/compressor low - miniz-oxide currently has a very expensive reset that is very measurable when dealing with the kernel pack.

Getting a better zlib decompressor/compressor is definitely high up on the list and I would be happy to allocate some time for it once you give the go. Ideally it's possible to avoid pulling in Flate2, but I'd do it if you think it's the right thing to do after all (as long as streams can be re-used thanks to a fast reset).

Thanks again, it's great enjoying your support, and you are having quite an impact on this young project, shaping it to the better!

0 replies

joshtriplett · 2020-08-12T08:28:19Z

joshtriplett
Aug 12, 2020

Whoop whoop, this feels like being knighted, and all the work totally paid off. Iteration after iteration after iteration, and it really was just Rust pushing me into doing "what's right", which also happens to be fast as a freebie.

You're doing incredible work, and I'm glad it's working out so well.

You should consider making a static copy of that pack and index available somewhere, to allow reproducible tests rather than testing with the latest Linux pack each time.

I will try, is there a platform you would recommend for that? Thus far I was unable to obtain the tiny kernel pack, 1.4GB only, as the clones would always abort midway given my incredible 6kb/s download speed. Alternatively, if you have an easy way to share that file, I would be happy to give downloading that a try.

Sure: <redacted by @Byron>

I'd suggest downloading with curl and making sure you can resume in case it gets aborted. Please let me know when you have them so I can remove them. I'll hang onto those same objects for future testing consistency.

SHA256 checksums:

231a2186704d7c2f92713857c9690b0d0eb745d790f0f2f54b0575700179e416  pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
5b69ded6054dd8650c593cd303ff8054591a71ce83d7195fdf5f3d8fca88784a  pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack

You may want to rate-limit the --verbose output, because it now adds a non-trivial fraction of the total time:

To me it looks like it's still a tiny fraction, but I might misread what is shown here.

It adds 0.6s to a 10.6s build, which is a 5.7% increase. Not huge, but it's a non-trivial amount of measurement overhead. That said, I completely agree that you don't need to prioritize it. Long-term, there'd be value in improving it just a little bit, since people may want a simple progress indicator for long operations, without needing the full detail you currently capture; however, so many other things are higher priority.

My disposition for this one is to leave it, as I am happy with the small overhead, and that turning it off works so well. It's all monomorphic and the branch predictor certainly has an easy time seeing that this thing is always off or always on :D.
It's kind of incredible that 96 threads are pounding that poor dashmap and it's all working smoothly. prodash was created to support this kind of load without threats having to limit themselves, so maybe there is something prodash can do to make progress lookups even faster. The lookup time is probably what's costing here, and in my single-threaded tests it got 50million progress updates per second, and all threads together just make 7.5 million in ~7s.

Interesting! And yeah, I'm impressed with how good the progress data is. If you ever want to work on it, we should do some actual profiling on the 96-way to see where the overhead is. (It'd be easy to capture perf profiling data, for instance.) And profiling would be a good idea for many other optimizations on the git implementation as well, once we're out of algorithmic improvements and similar. :)

Watching htop, it looks like across those 10s, the middle 4-5 seconds have all 96 CPUs completely saturated (nicely done), while the first and last few seconds have just a few threads running. [emphasis mine]

This all comes down to the chunk size, which right now is static and computed using a fantastic algorithm that probably should be done by someone who knows math. As the total of chunks can be recomputed I could imagine one know that in the last N chunks (N = num cores), we have to split them smaller to feed all threads. If that is implemented, I could imagine this ~6.2 number to go down to ~5.2 even. Definitely something to look into. Maybe as a first simple option, there could be a 'chunk-size' flag to override the batch size to tune it based on experience. Maybe that way it's already possible to squeeze out another half a second or so.

I did find that algorithm when I was looking for the chunk size, though I didn't dig into the details. As a quick hack, I tried dropping the upper number from 1000 to 250, which made no apparent difference in performance.

What number would you expect to see the --progress interface showing as the number of objects per thread?

It's odd though that the first few seconds there isn't enough work available for all threads to start right away, and might indicate that the single-threaded producer is not fast enough.

When I looked at the full --progress interface, it looks like the threads hadn't even started by that point.

With the less-memory algorithm, it breaks the 10s barrier now, albeit by using much more CPU time in the process:

It's great this one also got faster, even though the core algorithm is definitely unchanged. Probably it gets started more quickly now because the pack verification is done in parallel with everything else - thanks again for giving me the nudge in that direction!
This one probably won't get any faster as it relies entirely on the LRU cache with 64 entries per core for performance. It traverses the pack on a sorted list of pack offsets, ascending, which yields the best hit rate at low memory cost and quite fast preparation.
With a bigger and more performant LRU, there could be some gains here as well, but care should be taken to not make the 'less-memory' algorithm into a 'more-memory' one :D.
Whenever the cache is missed, objects have to be decompressed again, increasing the amount of unnecessary work done compared to the 'perfect' algorithm that is the less-time one. This is how it gets to burn the CPU more.

Makes perfect sense.

Would it be possible, with some care, to use the index to figure out in advance which objects will be needed again and which ones won't? Could you compute a small DAG of objects you need for deltas (without storing the objects themselves), and use that to decide the order you process objects in?

There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)

Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel?

Yes, but only if you know where these streams are. Right now the only way to find out (when only looking at the pack file) is to set the decompressor past the header of a pack entry and let it decompress till done. This marks the beginning of the next entry. If it would somehow be possible to find entries more quickly, an index can be built much faster. Index traversal already does limited only by sequential/burst IO speeds by skipping through the pack to read entry headers only. It can do that as it know where each entry begins though (and it's faster than seeking :D).

Ah, I see. So, anything other than indexing can use the index file to know where all the objects are and decompress them in parallel, but indexing itself has to decompress sequentially to find those offsets? Makes sense.

It's unfortunate that the pack file format stores the decompressed size of each object, but not the compressed size.

I was thinking that maybe there is a way pattern match bytes with some probability and find beginnings or ends of zlib streams, without the chance for ever missing an entry. If that was the case, one could probably screen the pack at IO speeds. A fun fact: nobody actually wants to decompress the objects during indexing - also gitoxide just decompresses them to learn the beginning of the next entry, the decompression result is thrown away. And I did have a version that could keep these results in memory, but it was simply not worth it as the cost in memory is too grand at 7.5 million objects, and for some reason I never got an inkling of additional performance when doing that.

Since you seem to be involved with zlib much more than I am, then let me restate that finding the beginning or end of a zlib stream reliably without decompressing it would be a major thing for clone speeds of git.

Oooh, that's a fun idea. I don't think that it's possible to find the end of the stream in constant time, and I don't think pattern-matching bytes would be possible. But I think that with some care, it would be possible to rapidly skip through a DEFLATE-compressed stream without decompressing it, if you don't actually care about the data itself and you only want to find the end of it. You could skip many of the actual DEFLATE steps, and just decode the symbols from the Huffman tables and skip over match/distance repeats without constructing the decompressed data.

See https://en.wikipedia.org/wiki/DEFLATE and https://tools.ietf.org/html/rfc1951 for details on the DEFLATE format.

I filed zlib-ng/zlib-ng#714 requesting an implementation of this in zlib-ng.

And while writing this, the quick zlib comparison came in: Thanks so much! Switching to zlib-ng will be very impactful!
For now I will not try it to leave you a chance to make it easier to use from Rust. Thus far I managed to avoid pulling in Flate2 to maintain better control and reduce compile times, putting the relatively little control code on top to support streaming and 'all-at-once' decompression. That makes it a bit harder to try new backends, but it's definitely something worth investing time in, as we might be looking at 18.5% faster decompression, while getting a 2.5x boost for compression.
Please note that it's also worth optimizing for keeping the time to reset the decompressor/compressor low - miniz-oxide currently has a very expensive reset that is very measurable when dealing with the kernel pack.

Getting a better zlib decompressor/compressor is definitely high up on the list and I would be happy to allocate some time for it once you give the go. Ideally it's possible to avoid pulling in Flate2, but I'd do it if you think it's the right thing to do after all (as long as streams can be re-used thanks to a fast reset).

As far as I know, I'm not aware of flate2 adding any significant overhead, and it provides fairly low-level interfaces in addition to high-level ones. If there's a good reason to, you could use libz-sys directly, but that's a less safe interface. Either way, if you port to libz-sys or to a crate like flate2 that's based on libz-sys, that'll make it trivial to switch to zlib-ng later, as well as making it easy to test zlib-ng now via LD_LIBRARY_PATH.

As far as I can tell, I think zlib's reset should be efficient and not involve reallocating or zeroing any state.

I do think you should go ahead and switch to something that's based on libz-sys underneath, whichever one works best for you.

EDIT: and using flate2 would give you the ability to pick between performance and pure-Rust without having to implement that support yourself.

Thanks again, it's great enjoying your support, and you are having quite an impact on this young project, shaping it to the better!

Thank you for all the incredible work you're doing; I'm more than happy to keep benchmarking and keep cheering you on, as well as providing the occasional suggestion on approaches. :)

0 replies

Byron · 2020-08-13T02:53:21Z

Byron
Aug 13, 2020
Maintainer Author

I'd suggest downloading with curl and making sure you can resume in case it gets aborted. Please let me know when you have them so I can remove them. I'll hang onto those same objects for future testing consistency.

The download is completed and I will use these files from hereon out. It seems the main difference to the ones produced by github is a greater zlib compression factor.

Interesting! And yeah, I'm impressed with how good the progress data is. If you ever want to work on it, we should do some actual profiling on the 96-way to see where the overhead is. (It'd be easy to capture perf profiling data, for instance.) And profiling would be a good idea for many other optimizations on the git implementation as well, once we're out of algorithmic improvements and similar. :)

I couldn't agree more! Learning from the previous iteration (the one that led to 0.3), I should really try not to be smart until there is some actual measurements. After all, that first version of the streaming code contained quite some complication to be able to keep decompressed streams in memory, for zero benefit except for boosting data structure size considerably (even if that feature wasn't used). Luckily, I feel not much time was lost, and fortunately it was never painful to implement it in the first place :D (I♥️Rust).

What number would you expect to see the --progress interface showing as the number of objects per thread?

I think every thread gets 50 base objects and will traverse the tree from there. The actual amount of work to do will differ greatly between the threads, there should be a lot of variance. I chose a small amount to avoid starvation, and with 1.2 million base objects, there should be enough chunks to keep most threads busy all the time.

When I looked at the full --progress interface, it looks like the threads hadn't even started by that point.

That's an interesting observation, and that seems to imply finding a way to spawn many threads faster would be a possible micro-optimization here. If every thread would take 10ms to start, we need a second to start ~~100~~ 96 of them, so the math seems to pan out. Alternatively, it's contention when fetching work - when starting, they all want work at the same time.

Would it be possible, with some care, to use the index to figure out in advance which objects will be needed again and which ones won't? Could you compute a small DAG of objects you need for deltas (without storing the objects themselves), and use that to decide the order you process objects in?

I think that's worth an attempt - it's easy to inject a different cache implementation. It's worth noting that there is tension between having a smarter cache that adds latency to an algorithm that otherwise is instantly producing results.

There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)

That's an interesting one, and I dreamt about streaming deltified objects a long time ago 😅. There even is an attempt to do that written in C, which I believe even works at least superficially.

The problem would still be latency and memory consumption. The way I see it, one would have to merge all deltas into one, a relatively costly operation for longer chains that costs at least one allocation. Then there is the base object, which is needed for most chunks (presumably) as deltas try their best not to copy new data. Keeping around the 'whole object', which I assume here is the base object isn't necessary, but most of the time it will have to be loaded to resolve portions of the delta.

Seeing how fast the current implementation is I am absolutely sure that the 'streaming delta resolution' will be slower and cost more compute, and I feel that this would really only be interesting for servers who want to stream a lot of deltified objects at reduced memory footprints, trading CPU time for it and to some extend, IO if they chose to reload the base object on each request of 4kB. The more I think about it, the more I believe implementing this at a net-win for the user requires a lot of prior research to identify which tradeoffs should be made there.

I filed zlib-ng/zlib-ng#714 requesting an implementation of this in zlib-ng.

Awesome, I am eagerly watching and promise that before doing anything related to creating packs, I will assure zlib-ng can be toggled on. The current zlib module (in gitoxide) is nothing I really like, it's the level of "happy it actually works and I hope to not have to touch it again", so it's certainly necessary to take another close look and improve it.

As far as I know, I'm not aware of flate2 adding any significant overhead, and it provides fairly low-level interfaces in addition to high-level ones. If there's a good reason to, you could use libz-sys directly, but that's a less safe interface. Either way, if you port to libz-sys or to a crate like flate2 that's based on libz-sys, that'll make it trivial to switch to zlib-ng later, as well as making it easy to test zlib-ng now via LD_LIBRARY_PATH.

Maybe the way to get a comfortable high-level interface is to switch to Flate2. Maybe I was hesitant for the wrong reasons - now that I have a custom implementation, it should be easy to throw it out and replace it with Flate2 to see if this negatively impacts performance after all. Maybe, maybe, not using it was a bit of a premature optimization, so definitely something I will look at as well. I think as long as there is streaming inflate and deflate, fast resets, and ways to deflate or inflate everything at once into pre-allocated buffers, gitoxide would be good to go.

It's unfortunate that the pack file format stores the decompressed size of each object, but not the compressed size.

Indeed, and I think it might be worth looking into what pack format V3 is going to be like. If that's fixed there, there might not be huge payoffs in making pack format V2 faster.

Thank you for all the incredible work you're doing; I'm more than happy to keep benchmarking and keep cheering you on, as well as providing the occasional suggestion on approaches. :)

Yes, let's do that 🙌!

0 replies

joshtriplett · 2020-08-14T05:40:56Z

joshtriplett
Aug 14, 2020

What number would you expect to see the --progress interface showing as the number of objects per thread?

I think every thread gets 50 base objects and will traverse the tree from there. The actual amount of work to do will differ greatly between the threads, there should be a lot of variance. I chose a small amount to avoid starvation, and with 1.2 million base objects, there should be enough chunks to keep most threads busy all the time.

When I saw the last few threads finishing up, the --progress output seemed to be saying they had a batch of around 8000 objects. So, I think something might be going wrong with batching.

When I looked at the full --progress interface, it looks like the threads hadn't even started by that point.

That's an interesting observation, and that seems to imply finding a way to spawn many threads faster would be a possible micro-optimization here. If every thread would take 10ms to start, we need a second to start ~~100~~ 96 of them, so the math seems to pan out. Alternatively, it's contention when fetching work - when starting, they all want work at the same time.

No, it's not that only some of the threads had started, it's that none of the threads had started, and it wasn't 1s, it was several full seconds. As far as I can tell, gitoxide spent several seconds not processing objects at all; it was doing other processing that wasn't running a thread per CPU.

There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)

That's an interesting one, and I dreamt about streaming deltified objects a long time ago sweat_smile. There even is an attempt to do that written in C, which I believe even works at least superficially.

The problem would still be latency and memory consumption. The way I see it, one would have to merge all deltas into one, a relatively costly operation for longer chains that costs at least one allocation. Then there is the base object, which is needed for most chunks (presumably) as deltas try their best not to copy new data. Keeping around the 'whole object', which I assume here is the base object isn't necessary, but most of the time it will have to be loaded to resolve portions of the delta.

Seeing how fast the current implementation is I am absolutely sure that the 'streaming delta resolution' will be slower and cost more compute, and I feel that this would really only be interesting for servers who want to stream a lot of deltified objects at reduced memory footprints, trading CPU time for it and to some extend, IO if they chose to reload the base object on each request of 4kB. The more I think about it, the more I believe implementing this at a net-win for the user requires a lot of prior research to identify which tradeoffs should be made there.

I don't think you'd have to keep the base object around; it should be possible to just keep the deflate streams active for an object currently being streamed. Tiny deltas (those smaller than a zlib state) could be kept in memory, while larger deltas or base objects could be kept as zlib streams plus the most recent chunk of data. Each time the caller asks for more data, you could process the deltas and walk the zlib streams forward. (That's assuming that deltas reference the base object more-or-less sequentially, otherwise this might well be a loss.) The goal would be to maintain a kind of "cursor" that can walk forward through an object that might be several MB but isn't being read all at once. (The whole pack would already be mmapped, and the corresponding index loaded.)

I don't know if this would work well or not, but it seems worth an experiment.

0 replies

Byron · 2020-08-14T06:20:41Z

Byron
Aug 14, 2020
Maintainer Author

When I saw the last few threads finishing up, the --progress output seemed to be saying they had a batch of around 8000 objects. So, I think something might be going wrong with batching.

I think what you probably saw is a thread ending up with 8000 objects to handle, based on the 50 base objects it got. During traversal, threads use unbounded progress reporting, as the actual number isn't known ahead of time except that it's at least 50. It might be that for some reason the chunksize computation is off, that would be a bug. Tests seem to indicate that it shouldn't happen though for kernel-pack sized requests.

No, it's not that only some of the threads had started, it's that none of the threads had started, and it wasn't 1s, it was several full seconds. As far as I can tell, gitoxide spent several seconds not processing objects at all; it was doing other processing that wasn't running a thread per CPU.

Aaah, good to know, this restores my faith in the performance of spawning threads :D. Indeed, it takes a moment until the index is built, and all that should be visible. It should never do nothing visibly, that would be a bug as well.
The line renderer is built to not display progress in the first second to not clutter up fast runs, maybe that's related as well. The TUI, however, will always display all progress without delay.

I don't think you'd have to keep the base object around; it should be possible to just keep the deflate streams active for an object currently being streamed. Tiny deltas (those smaller than a zlib state) could be kept in memory, while larger deltas or base objects could be kept as zlib streams plus the most recent chunk of data. Each time the caller asks for more data, you could process the deltas and walk the zlib streams forward. (That's assuming that deltas reference the base object more-or-less sequentially, otherwise this might well be a loss.) The goal would be to maintain a kind of "cursor" that can walk forward through an object that might be several MB but isn't being read all at once. (The whole pack would already be mmapped, and the corresponding index loaded.)

I don't know if this would work well or not, but it seems worth an experiment.

I see, I didn't think even keeping the zlib streams around under the given conditions, it makes perfect sense and seems to be the extra mile.

I think once the zlib handling is refactored to be nicer and support zlib-ng, such an experiment could be run. My suggestion is to do thorough analysis in an issue beforehand to think about the involved tradeoffs and the desired outcome, maybe along with a test that triggers current, unwanted behaviour that one tries to improve.

0 replies

joshtriplett · 2020-08-19T03:01:51Z

joshtriplett
Aug 19, 2020

The latest version of the flate2 crate supports building with the zlib-ng-compat feature to enable zlib-ng in zlib-compat mode. That should substantially improve performance.

flate2 = { version = "1.0.17", features = ["zlib-ng-compat"], default-features = false }

0 replies

Byron · 2020-08-19T03:05:14Z

Byron
Aug 19, 2020
Maintainer Author

I can't wait to try it out! By the looks of it, after this iteration (clone) will be an iteration to get basic push working.
For that, there will be the first stab at building packs, which is a good motivation to overhaul the current zlib implementation and support more backends, one way or another.

0 replies

joshtriplett · 2020-08-20T04:40:44Z

joshtriplett
Aug 20, 2020

The current verifying and indexing code would also benefit from faster decompression.

0 replies

Byron · 2020-08-20T06:49:05Z

Byron
Aug 20, 2020
Maintainer Author

Absolutely, they will benefit 'for free' when the changes have landed.
All code for this currently rests in git-odb, and I hope this can be turned into something that can feature-toggle between miniz-oxide and the libz-sys crate, potentially with vendored (and static) zlib-ng enabled.

Maybe for you that would be trivial to try out even, given your knowledge in the field, and maybe you would even conclude that flate2 would be all this project needs. I avoided it because of its complexity, but if it can do fast resets and won't allocate during a reset, I think it will be suitable.

0 replies

Byron · 2020-12-05T15:25:39Z

Byron
Dec 5, 2020
Maintainer Author

Results on a MacBook Air M1, using INTEL binaries:

➜  gitoxide git:(main) ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 15:17:37 SHA1 of index done 212.8MB in 0.44s (480.9MB/s)
 15:17:37 collecting sorted index done 7.6M entries in 1.38s (5.5M entries/s)
 15:17:40                indexing done 7.6M objects in 2.29s (3.3M objects/s)
 15:17:40            SHA1 of pack done 1.4GB in 3.69s (369.1MB/s)
 15:18:23               Resolving done 7.6M objects in 43.14s (176.2k objects/s)
 15:18:23                Decoding done 95.6GB in 43.14s (2.2GB/s)
➜  gitoxide git:(main) ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 15:19:32 SHA1 of index done 212.8MB in 0.43s (498.3MB/s)
 15:19:32 collecting sorted index done 7.6M entries in 1.22s (6.2M entries/s)
 15:19:34            SHA1 of pack done 1.4GB in 2.72s (500.0MB/s)
 15:19:35                indexing done 7.6M objects in 2.26s (3.4M objects/s)
 15:20:17               Resolving done 7.6M objects in 42.69s (178.0k objects/s)
 15:20:17                Decoding done 95.6GB in 42.69s (2.2GB/s)
➜  gitoxide git:(main)

And for comparison, the same with the git binary, also INTEL.

➜  gitoxide git:(main)  time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
git -c pack.threads=8 index-pack --verify -v --show-resolving-progress   280.29s user 22.96s system 434% cpu 1:09.71 total

0 replies

Byron · 2020-12-06T13:00:58Z

Byron
Dec 6, 2020
Maintainer Author

And here is the result with ARM binaries on a MacBook Air with M1 chip.

➜  gitoxide git:(main) ✗ ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 12:53:09 SHA1 of index done 212.8MB in 0.27s (778.6MB/s)
 12:53:10 collecting sorted index done 7.6M entries in 0.94s (8.1M entries/s)
 12:53:12            SHA1 of pack done 1.4GB in 2.87s (475.0MB/s)
 12:53:12                indexing done 7.6M objects in 2.32s (3.3M objects/s)
 12:53:39               Resolving done 7.6M objects in 26.44s (287.5k objects/s)
 12:53:39                Decoding done 95.6GB in 26.44s (3.6GB/s)
➜  gitoxide git:(main) ✗ ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 12:53:42 SHA1 of index done 212.8MB in 0.25s (844.9MB/s)
 12:53:43 collecting sorted index done 7.6M entries in 0.81s (9.4M entries/s)
 12:53:44            SHA1 of pack done 1.4GB in 1.55s (876.7MB/s)
 12:53:45                indexing done 7.6M objects in 1.90s (4.0M objects/s)
 12:54:11               Resolving done 7.6M objects in 26.52s (286.6k objects/s)
 12:54:11                Decoding done 95.6GB in 26.52s (3.6GB/s)

I had to run the git version of verification twice just to assure myself that I can trust my eyes: The arm git version is significantly faster than the one I was testing previously, despite still being a little slower than gitoxide.

➜  gitoxide git:(main) ✗ time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
git -c pack.threads=8 index-pack --verify -v --show-resolving-progress   92.66s user 21.61s system 353% cpu 32.283 total
➜  gitoxide git:(main) ✗ time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
git -c pack.threads=8 index-pack --verify -v --show-resolving-progress   92.79s user 22.30s system 355% cpu 32.390 total
➜  gitoxide git:(main) ✗ file `which git`
/usr/bin/git: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64e:Mach-O 64-bit executable arm64e]
/usr/bin/git (for architecture x86_64): Mach-O 64-bit executable x86_64
/usr/bin/git (for architecture arm64e): Mach-O 64-bit executable arm64e
➜  gitoxide git:(main) ✗ git --version
git version 2.24.3 (Apple Git-128)
➜  gitoxide git:(main) ✗

What's incredible is that the effective user time of git indicates it's using no more than 4 threads, bringing its user time (and possibly efficiency) down to ~93s, whereas gitoxide uses all cores and more user time only for a small gain.

➜  gitoxide git:(main) ✗ time ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 13:59:25 SHA1 of index done 212.8MB in 0.25s (856.5MB/s)
 13:59:25 collecting sorted index done 7.6M entries in 0.82s (9.2M entries/s)
 13:59:26            SHA1 of pack done 1.4GB in 1.54s (883.4MB/s)
 13:59:27                indexing done 7.6M objects in 2.15s (3.5M objects/s)
 13:59:54               Resolving done 7.6M objects in 26.37s (288.3k objects/s)
 13:59:54                Decoding done 95.6GB in 26.37s (3.6GB/s)
./target/release/gixp --verbose pack-verify   207.23s user 1.62s system 707% cpu 29.535 total

When running again and limiting the process to 4 threads, the user time is still significantly higher, making git a clear winning when it comes to efficiency no matter what.

➜  gitoxide git:(main) ✗ time ./target/release/gixp -t 4 --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 14:01:29 SHA1 of index done 212.8MB in 0.26s (823.4MB/s)
 14:01:30 collecting sorted index done 7.6M entries in 0.84s (9.0M entries/s)
 14:01:31            SHA1 of pack done 1.4GB in 1.61s (847.6MB/s)
 14:01:32                indexing done 7.6M objects in 2.18s (3.5M objects/s)
 14:02:09               Resolving done 7.6M objects in 36.69s (207.2k objects/s)
 14:02:09                Decoding done 95.6GB in 36.69s (2.6GB/s)
./target/release/gixp -t 4 --verbose pack-verify   150.37s user 1.38s system 380% cpu 39.879 total

0 replies

edward-shen · 2021-03-15T15:34:42Z

edward-shen
Mar 15, 2021

I tried switching to zlib-ng on a test branch (edward-shen/odb-zlib-ng), but I wasn't able to see significant improvements. I'm using the linux repo on commit f296bf for the pack file. Perhaps I'm not using the flate2 crate right, or perhaps we're bottlenecked elsewhere? My tests seem to imply that gixp already as performant as git but perhaps I'm misunderstanding the metrics here?

main on a Ryzen 3700x:

time ./target/release/gixp --verbose pack-explode ../linux/.git/objects/pack/pack-3f6908a6a21ddeada5131536e7b9777255345261.pack
 11:19:04 SHA1 of index done 267.7MB in 0.19s (1.4GB/s)
 11:19:05 collecting sorted index done 8.0M entries in 1.81s (4.4M entries/s)
 11:19:07            SHA1 of pack done 3.2GB in 3.58s (907.5MB/s)
 11:19:10                indexing done 8.0M objects in 4.67s (1.7M objects/s)
 11:19:24               Resolving done 8.0M objects in 13.77s (579.9k objects/s)
 11:19:24                Decoding done 101.3GB in 13.77s (7.4GB/s)
./target/release/gixp --verbose pack-explode   206.01s user 18.84s system 1094% cpu 20.541 total

edward-shen/odb-zlib-ng:

time ./target/release/gixp --verbose pack-explode ../linux/.git/objects/pack/pack-3f6908a6a21ddeada5131536e7b9777255345261.pack
 11:23:10 SHA1 of index done 267.7MB in 0.16s (1.7GB/s)
 11:23:12 collecting sorted index done 8.0M entries in 1.75s (4.6M entries/s)
 11:23:12            SHA1 of pack done 3.2GB in 2.18s (1.5GB/s)
 11:23:16                indexing done 8.0M objects in 3.65s (2.2M objects/s)
 11:23:29               Resolving done 8.0M objects in 13.11s (609.4k objects/s)
 11:23:29                Decoding done 101.3GB in 13.11s (7.7GB/s)
./target/release/gixp --verbose pack-explode   206.43s user 7.45s system 1135% cpu 18.831 total

main, limited to 4 threads:

time ./target/release/gixp -t4 --verbose pack-explode ../linux/.git/objects/pack/pack-3f6908a6a21ddeada5131536e7b9777255345261.pack
 11:20:22 SHA1 of index done 267.7MB in 0.18s (1.5GB/s)
 11:20:24 collecting sorted index done 8.0M entries in 1.84s (4.3M entries/s)
 11:20:25            SHA1 of pack done 3.2GB in 2.23s (1.5GB/s)
 11:20:28                indexing done 8.0M objects in 3.84s (2.1M objects/s)
 11:21:07               Resolving done 8.0M objects in 38.98s (204.9k objects/s)
 11:21:07                Decoding done 101.3GB in 38.98s (2.6GB/s)
./target/release/gixp -t4 --verbose pack-explode   162.82s user 1.77s system 366% cpu 44.941 total

edward-shen/odb-zlib-ng, 4 threads:

time ./target/release/gixp -t4 --verbose pack-explode ../linux/.git/objects/pack/pack-3f6908a6a21ddeada5131536e7b9777255345261.pack
 11:24:17 SHA1 of index done 267.7MB in 0.15s (1.8GB/s)
 11:24:19  SHA1 of pack done 3.2GB in 1.82s (1.8GB/s)
 11:24:19 collecting sorted index done 8.0M entries in 1.91s (4.2M entries/s)
 11:24:22                indexing done 8.0M objects in 3.72s (2.1M objects/s)
 11:25:01               Resolving done 8.0M objects in 38.59s (207.0k objects/s)
 11:25:01                Decoding done 101.3GB in 38.59s (2.6GB/s)
./target/release/gixp -t4 --verbose pack-explode   160.75s user 1.76s system 365% cpu 44.496 total

And for a baseline, git index-pack:

time git -c pack.threads=4 index-pack --verify -v --show-resolving-progress ../linux/.git/objects/pack/pack-3f6908a6a21ddeada5131536e7b9777255345261.pack
Indexing objects: 100% (7987687/7987687), done.
Resolving deltas: 100% (6638541/6638541), done.
git -c pack.threads=4 index-pack --verify -v --show-resolving-progress   325.12s user 8.87s system 254% cpu 2:11.39 total

6 replies

joshtriplett Mar 15, 2021

Can you check cargo tree -e features to make sure nothing is causing libz-sys or flate2 to build with zlib (or with default features) instead?

This is definitely surprising.

EDIT: looks like @Byron is on top of it as always. :)

edward-shen Mar 15, 2021

Looks like I was just using the wrong command 🤦. Sorry for the false alarm.

Here is the output with the new command--looks like we are getting the advertised 25% (heck, more like 44%) increase in decompression speed (131MB/s vs 189MB/s).

main:

time ./target/release/gixp --verbose pack-index-from-data -p  ../linux/.git/objects/pack/pack-3f6908a6a21ddeada5131536e7b9777255345261.pack
 12:13:09 read pack done 3.2GB in 53.75s (60.5MB/s)
 12:13:09  indexing done 8.0M objects in 53.75s (148.6k objects/s)
 12:13:09 decompressing done 7.0GB in 53.75s (130.9MB/s)
 12:13:17     Resolving done 8.0M objects in 8.24s (969.4k objects/s)
 12:13:17      Decoding done 101.3GB in 8.24s (12.3GB/s)
 12:13:20 writing index file done 267.7MB in 0.39s (692.7MB/s)
 12:13:20  create index file done 8.0M objects in 64.82s (123.2k objects/s)
index: dbd124b4e1b872bbc5af896d2eaeedfb56f9ed64
pack: 3f6908a6a21ddeada5131536e7b9777255345261
./target/release/gixp --verbose pack-index-from-data -p   184.53s user 6.86s system 294% cpu 1:05.01 total

zlib-ng:

time ./target/release/gixp --verbose pack-index-from-data -p  ../linux/.git/objects/pack/pack-3f6908a6a21ddeada5131536e7b9777255345261.pack
 12:10:28 read pack done 3.2GB in 37.18s (87.4MB/s)
 12:10:28  indexing done 8.0M objects in 37.18s (214.8k objects/s)
 12:10:28 decompressing done 7.0GB in 37.18s (189.3MB/s)
 12:10:37     Resolving done 8.0M objects in 8.51s (938.6k objects/s)
 12:10:37      Decoding done 101.3GB in 8.51s (11.9GB/s)
 12:10:40 writing index file done 267.7MB in 0.37s (716.1MB/s)
 12:10:40  create index file done 8.0M objects in 48.51s (164.7k objects/s)
index: dbd124b4e1b872bbc5af896d2eaeedfb56f9ed64
pack: 3f6908a6a21ddeada5131536e7b9777255345261
./target/release/gixp --verbose pack-index-from-data -p   171.67s user 7.02s system 366% cpu 48.692 total

joshtriplett Mar 15, 2021

@edward-shen Awesome!

Given that performance boost, can I suggest switching over to flate2 entirely, to make the code easier to maintain and eliminate the various cfg branches? That shouldn't have a substantial impact on plain zlib, and will make switching between zlib backends much easier.

edward-shen Mar 15, 2021

That's the plan for the branch. Currently debugging some regressions, but hopefully we get there soon.

Byron Mar 16, 2021
Maintainer Author

Here is the result of my test on an M1 MacBook Air, with zlib-ng…

 time ./target/release/gixp --verbose pack-index-from-data -p tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
 11:56:31 read pack done 1.4GB in 15.03s (90.5MB/s)
 11:56:31  indexing done 7.6M objects in 15.03s (505.6k objects/s)
 11:56:31 decompressing done 2.7GB in 15.03s (180.7MB/s)
 11:56:58     Resolving done 7.6M objects in 26.39s (288.0k objects/s)
 11:56:58      Decoding done 95.6GB in 26.39s (3.6GB/s)
index: 3fe49647a452e0f7ee6a857fb65ee243f6e38bb3==============================================================================================================>-----------------------------------------]
pack: 3ee05b0f4e4c2cb59757c95c68e2d13c0a491289===============================================================================================================>-----------------------------------------]
 11:56:59 writing index file done 212.8MB in 0.40s (535.1MB/s)
 11:56:59  create index file done 7.6M objects in 42.98s (176.8k objects/s)
./target/release/gixp --verbose pack-index-from-data -p   208.45s user 3.35s system 492% cpu 43.011 total

…and without:

time ./target/release/gixp --verbose pack-index-from-data -p tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
 12:02:45 read pack done 1.4GB in 29.99s (45.4MB/s)
 12:02:45  indexing done 7.6M objects in 29.99s (253.4k objects/s)
 12:02:45 decompressing done 2.7GB in 29.99s (90.6MB/s)
 12:03:15     Resolving done 7.6M objects in 29.43s (258.3k objects/s)
 12:03:15      Decoding done 95.6GB in 29.43s (3.2GB/s)
 12:03:16 writing index file done 212.8MB in 0.40s (535.9MB/s)
 12:03:16  create index file done 7.6M objects in 60.98s (124.6k objects/s)
index: 3fe49647a452e0f7ee6a857fb65ee243f6e38bb3
pack: 3ee05b0f4e4c2cb59757c95c68e2d13c0a491289
./target/release/gixp --verbose pack-index-from-data -p   246.82s user 3.22s system 405% cpu 1:01.59 total

Most notably it speeds up the indexing part to the point where it matches git, leading me to the guess that they can also build with zlib-ng under the hood. Maybe it's an apple specific patch though.

time  git -c pack.threads=4 index-pack  -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
3ee05b0f4e4c2cb59757c95c68e2d13c0a491289
git -c pack.threads=4 index-pack -v --show-resolving-progress   76.56s user 5.94s system 242% cpu 34.009 total
➜  gitoxide git:(main) git --version
git version 2.24.3 (Apple Git-128)

Note that this version of git actually beats gitoxide pretty clearly even with the zlib-ng patch. When git is run with more threads, it gets slower - this version still doesn't scale linearly with threads, but when gitoxide is limited to four threads, it gets slower probably indicating that git does something very efficient when resolving the pack.

time ./target/release/gixp -t 4 --verbose pack-index-from-data -p tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
 12:10:07 read pack done 1.4GB in 15.11s (90.1MB/s)
 12:10:07  indexing done 7.6M objects in 15.11s (502.9k objects/s)
 12:10:07 decompressing done 2.7GB in 15.11s (179.8MB/s)
 12:10:43     Resolving done 7.6M objects in 36.04s (210.9k objects/s)
 12:10:43      Decoding done 95.6GB in 36.04s (2.7GB/s)
index: 3fe49647a452e0f7ee6a857fb65ee243f6e38bb3==============================================================================================================>-----------------------------------------]
pack: 3ee05b0f4e4c2cb59757c95c68e2d13c0a491289=======================================================================>---------------------------------------------------------------------------------]
 12:10:45 writing index file done 212.8MB in 0.40s (534.2MB/s)
 12:10:45  create index file done 7.6M objects in 52.70s (144.2k objects/s)
./target/release/gixp -t 4 --verbose pack-index-from-data -p   157.13s user 1.88s system 298% cpu 53.256 total

As we see this is something Apple or ARM specific, as @edward-shen couldn't reproduce that at all.
For now I wouldn't fret it but it's something that will force future gitoxide is faster than git blogposts to carry some * with amendments ;).

Byron · 2021-10-29T01:09:24Z

Byron
Oct 29, 2021
Maintainer Author

The latest results for verifying a pack on the M1Pro (10 cores). It's notable how much faster M1Pro has gotten at that task, with Sha1 reaching 1.8GB/s on a single core.

➜  gitoxide git:(main) cargo build --release --no-default-features --features lean --bin gix && /usr/bin/time -lp ./target/release/gix -v pack-verify tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
    Finished release [optimized] target(s) in 0.07s
 00:38:45 SHA1 of index done 212.8MB in 0.12s (1.8GB/s)
 00:38:46  SHA1 of pack done 1.4GB in 0.74s (1.8GB/s)
 00:38:46 collecting sorted index done 7.6M entries in 1.28s (5.9M entries/s)
 00:38:48                indexing done 7.6M objects in 1.52s (5.0M objects/s)
 00:38:58               Resolving done 7.6M objects in 9.66s (786.8k objects/s)
 00:38:58                Decoding done 95.6GB in 9.66s (9.9GB/s)
real 12.67
user 77.74
sys 19.24
          3259432960  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              224901  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
               68379  involuntary context switches
        453757880024  instructions retired
        279425661238  cycles elapsed
          1257820160  peak memory footprint

And here is git (limited to 4 threads as it doesn't scale beyond that, getting slower instead).

➜  gitoxide git:(main) /usr/bin/time -lp  git -c pack.threads=4 verify-pack -v -s tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
non delta: 1203659 objects
chain length = 1: 525671 objects
[…]
chain length = 50: 7080 objects
real 39.82
user 79.38
sys 12.95
          1329971200  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               84118  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   3  voluntary context switches
             1266695  involuntary context switches
            14365168  instructions retired
             6570032  cycles elapsed
             1147520  peak memory footprint

And an apples vs apples comparison with gix being limited to 4 threads.

➜  gitoxide git:(main) cargo build --release --no-default-features --features lean --bin gix && /usr/bin/time -lp ./target/release/gix -t 4 -v pack-verify tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
    Finished release [optimized] target(s) in 0.07s
 00:43:50 SHA1 of index done 212.8MB in 0.12s (1.8GB/s)
 00:43:50  SHA1 of pack done 1.4GB in 0.74s (1.8GB/s)
 00:43:51 collecting sorted index done 7.6M entries in 1.30s (5.8M entries/s)
 00:43:52                indexing done 7.6M objects in 1.53s (5.0M objects/s)
 00:44:12               Resolving done 7.6M objects in 19.68s (386.1k objects/s)
 00:44:12                Decoding done 95.6GB in 19.68s (4.9GB/s)
real 22.72
user 71.58
sys 11.27
          3078586368  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              213557  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
               47372  involuntary context switches
        481067687830  instructions retired
        253078554834  cycles elapsed
          1257984128  peak memory footprint

One sees that it scales pretty well with cores, maybe even perfectly, as the parallel part of the operation is indeed not requiring any synchronization to speak of.

And now index creation, meaning decompressing and hashing every object.

➜  gitoxide git:(main) cargo build --release --bin gix && /usr/bin/time -lp  ./target/release/gix --verbose pack-index-from-data -p tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
   Compiling git-features v0.17.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-features)
   Compiling git-actor v0.6.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-actor)
   Compiling git-transport v0.13.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-transport)
   Compiling git-commitgraph v0.6.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-commitgraph)
   Compiling git-object v0.15.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-object)
   Compiling git-protocol v0.12.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-protocol)
   Compiling git-diff v0.11.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-diff)
   Compiling git-traverse v0.10.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-traverse)
   Compiling git-ref v0.9.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-ref)
   Compiling git-pack v0.13.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-pack)
   Compiling git-odb v0.23.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-odb)
   Compiling git-repository v0.11.0 (/Users/byron/dev/github.com/Byron/gitoxide/git-repository)
   Compiling gitoxide-core v0.12.0 (/Users/byron/dev/github.com/Byron/gitoxide/gitoxide-core)
   Compiling gitoxide v0.10.0 (/Users/byron/dev/github.com/Byron/gitoxide)
    Finished release [optimized] target(s) in 48.95s
 00:48:06 read pack done 1.4GB in 11.84s (114.9MB/s)
 00:48:06  indexing done 7.6M objects in 11.84s (641.9k objects/s)
 00:48:06 decompressing done 2.7GB in 11.84s (229.4MB/s)
 00:48:16     Resolving done 7.6M objects in 10.01s (759.3k objects/s)
 00:48:16      Decoding done 95.6GB in 10.01s (9.6GB/s)
index: 3fe49647a452e0f7ee6a857fb65ee243f6e38bb3==============================================================================================================>-----------------------------------------]
pack: 3ee05b0f4e4c2cb59757c95c68e2d13c0a491289===============================================================================================================>-----------------------------------------]
 00:48:17 writing index file done 212.8MB in 0.23s (933.7MB/s)
 00:48:18  create index file done 7.6M objects in 23.35s (325.5k objects/s)
real 23.92
user 87.63
sys 20.28
          2671411200  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              116316  page reclaims
               62076  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  80  voluntary context switches
              196719  involuntary context switches
        539547239999  instructions retired
        316041409168  cycles elapsed
           884560256  peak memory footprint

And the same in git.

➜  gitoxide git:(main) time  git -c pack.threads=4 index-pack  -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
3ee05b0f4e4c2cb59757c95c68e2d13c0a491289
git -c pack.threads=4 index-pack -v --show-resolving-progress   78.45s user 11.51s system 233% cpu 38.541 total

0 replies

[performance] beating git in git verify-pack ✅ 🚀 #47

Byron Jun 29, 2020 Maintainer

When trying to beat git verify-pack

Attempt 1

Observation

How to fix

Replies: 33 comments · 6 replies

Byron Jul 20, 2020 Maintainer Author

Byron Jul 21, 2020 Maintainer Author

Byron Jul 21, 2020 Maintainer Author

Byron Jul 21, 2020 Maintainer Author

Byron Jul 21, 2020 Maintainer Author

Byron Jul 21, 2020 Maintainer Author

Byron Jul 22, 2020 Maintainer Author

My answers

Byron Jul 22, 2020 Maintainer Author

Byron Jul 23, 2020 Maintainer Author

Byron Jul 23, 2020 Maintainer Author

Byron Jul 23, 2020 Maintainer Author

Byron Aug 6, 2020 Maintainer Author

Quick informal benchmarks of DEFLATE implementations

miniz (default pure-Rust implementation)

miniz-sys (miniz.c)

zlib (system zlib on Debian, package version 1:1.2.11.dfsg-2)

cloudflare_zlib (Note: major portability issues, only works on modern systems)

zlib-ng (uses CPU feature detection; tested by compiling for zlib and setting LD_LIBRARY_PATH)

Summary

Byron Aug 12, 2020 Maintainer Author

Byron Aug 13, 2020 Maintainer Author

Byron Aug 14, 2020 Maintainer Author

Byron Aug 19, 2020 Maintainer Author

Byron Aug 20, 2020 Maintainer Author

Byron Dec 5, 2020 Maintainer Author

Byron Dec 6, 2020 Maintainer Author

Byron Mar 16, 2021 Maintainer Author

Byron Oct 29, 2021 Maintainer Author

[performance] beating git in `git verify-pack` ✅ 🚀 #47

Byron
Jun 29, 2020
Maintainer

When trying to beat `git verify-pack`

Replies: 33 comments 6 replies

Byron
Jul 20, 2020
Maintainer Author

Byron
Jul 21, 2020
Maintainer Author

Byron
Jul 21, 2020
Maintainer Author

Byron
Jul 21, 2020
Maintainer Author

Byron
Jul 21, 2020
Maintainer Author

Byron
Jul 21, 2020
Maintainer Author

Byron
Jul 22, 2020
Maintainer Author

Byron
Jul 22, 2020
Maintainer Author

Byron
Jul 23, 2020
Maintainer Author

Byron
Jul 23, 2020
Maintainer Author

Byron
Jul 23, 2020
Maintainer Author

Byron
Aug 6, 2020
Maintainer Author

miniz-sys (`miniz.c`)

zlib (system zlib on Debian, package version `1:1.2.11.dfsg-2`)

zlib-ng (uses CPU feature detection; tested by compiling for zlib and setting `LD_LIBRARY_PATH`)

Byron
Aug 12, 2020
Maintainer Author

Byron
Aug 13, 2020
Maintainer Author

Byron
Aug 14, 2020
Maintainer Author

Byron
Aug 19, 2020
Maintainer Author

Byron
Aug 20, 2020
Maintainer Author

Byron
Dec 5, 2020
Maintainer Author

Byron
Dec 6, 2020
Maintainer Author

Byron Mar 16, 2021
Maintainer Author

Byron
Oct 29, 2021
Maintainer Author