[performance] beating git in git verify-pack
β
π
#47
Replies: 33 comments 6 replies
-
Measurements with the lookup based algorithm, constrained to three threads for fairness. Lookup is clearly slower, but thanks to LRU cache not nearly as bad as it could be given the enormous amount of additional work that it does. |
Beta Was this translation helpful? Give feedback.
-
note that the statistics in the image above show delta chain length of 1 at most, as we have a 'perfect' cache due to the tree-like delta traversal. We are now faster than git for the very first time, a win that didn't come easily and certainly isn't perfect: git uses less system calls, and doesn't seem to rely visibly on memory mapping packs for random access. Git seems to use way less memory because of that. However, our implementation scales perfectly and isn't artificially limited to three threads (unless this is somewhat covertly overridden by the Since implementing new variants on how to do this kind of work quickly is fairly straightforward, this might not even be the end of it. |
Beta Was this translation helpful? Give feedback.
-
@joshtriplett If you feel like setting a new speed record, you could try running the latest version against the kernel pack (and publish your results here :) ). This time, and with three threads, the Rust version seems to be equal or a little faster, gaining more and more of an advantage the more cores you add. Git does indeed not scale well and gets slower beyond three cores. Now there is a (My current plan is to eventually post a link to this issue on reddit) |
Beta Was this translation helpful? Give feedback.
-
A few more tests show that the My expectation would nonetheless be that it will scale well with more cores. |
Beta Was this translation helpful? Give feedback.
-
Interestingly, with more threads than logically available in the machine, there is still some gain to be had - 6 threads on a 4 core machine seems to yield the best results, allowing my machine to crack the 1GB mark! |
Beta Was this translation helpful? Give feedback.
-
According to thisβ¦ β¦git-oxide is now 27.6/19.5 = 1.41x faster than git in this particular task with three threads, whereas it's 27.6/16.0 = 1.72x faster with 4 threads. And it could scale with more cores. I wonder what the difference in algorithm is that causes git to use less memory (or not rely on memory maps) but fail to scale. |
Beta Was this translation helpful? Give feedback.
-
Just tested. I made sure the packfile and the index were in cache first; cached reads of them off disk take about 0.23s and 0.04s, respectively. Is git's
Note on git-oxide: its help says it can take either a .pack file or a .idx file, but when run on a .pack it doesn't seem to do anything visible except read the pack. To get anything interesting I had to run it on the index. I also found that with 96 threads, the less-memory algorithm was much faster; it uses more user time, but finishes in half the real time, and pulls off more than 10GB/s. Results with git-oxide:
You've definitely got a scaling problem in the default less-time algorithm, because with 48 threads instead of 96, it finishes faster:
By contrast, the less-memory algorithm ran just barely slower with 48 threads; that suggests a scaling problem as well, but not quite as bad of one:
Results at 24 threads:
12 threads:
And 6 threads:
Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking? Also, the
A bunch of threads with 0-2 entries looks like a problematic bottleneck. Hope that helps. |
Beta Was this translation helpful? Give feedback.
-
Thanks so much for taking the time! Very informative, and very valuable informationβ¦! Right now I wouldn't know if this can get any faster, or if at some point the memory subsystem of the machine is simply overloaded if too many threads compete. It's accessing the memory mapped pack, decompresses the data and computes SHA1 and CRC32 on all entries, and without debugging on such a machine I doubt any further optimizations can be done reliably. My answers
At least that's what I think, more or less. The idea is to build a tree of deltas to allow traversing the chain from base to delta, and not from delta to base. This allows for a cache implementation that is perfect, i.e. a delta can be applied to it's base from cache all the time.
Without having studied git's code in detail, it looks like it decompresses objects while indexing them, storing some of them in memory. To me it seems it doesn't need an index for this to work.
Yes, related to the above. A valid pack always has an index and I think it's beneficial to make use of it. The reason git doesn't do it to that extend might be that
Indeed, the parallelisation is done using a super simple fan-in-fan-out map reduce using nothing more than std threads and channels. In a way, this already is work stealing, but it can only steal entire chunks of work. Right now, the chunk size is rather big, and the smaller it is, the more one pays for the channel synchronisation. Probably some tuning is possible there. Maybe it's worth trying to have chunks as small as 10 (objects) - for all I know having one object (without chunking) is visibly slowing it down due to sync overhead. |
Beta Was this translation helpful? Give feedback.
-
Regarding the Maybe using a different channel implementation can speed this up - right now |
Beta Was this translation helpful? Give feedback.
-
I have played around with the relevant variables a little and came up with a better chunk size (read: higher) chunk size for the 'less-time' algorithm, while opening the code up a little for smarter heuristics as time passes. Ideally, it finds chunk sizes that keep threads busy for no more than around a second - that way starvation won't really be noticeable, while keeping the syncronization cost low. |
Beta Was this translation helpful? Give feedback.
-
Closing this issue as |
Beta Was this translation helpful? Give feedback.
-
It could be done, certainly if optimizing for speed is the goal. However, I thought it's not necessary to start checking if the SHA1 of the pack already turns out to be wrong. Maybe there could be a flag to simply skip the Sha1 checks and proceed with the per-object checks. Something like that could certainly be added though if it should get a little faster overall. |
Beta Was this translation helpful? Give feedback.
-
On Wed, Jul 22, 2020 at 10:00:09PM -0700, Sebastian Thiel wrote:
@joshtriplett
> Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking?
It could be done, certainly if optimizing for speed is the goal. However, I thought it's not necessary to start checking if the SHA1 of the pack already turns out to be wrong.
Maybe there could be a flag to simply skip the Sha1 checks and proceed with the per-object checks.
Something like that could certainly be added though if it should get a little faster overall.
I'm less concerned about the performance of verify-pack in particular,
and more thinking that in general git-oxide should try to do as much as
possible in parallel.
|
Beta Was this translation helpful? Give feedback.
-
With further optimizations, I now get down to 90s for verifying the kernel pack, compared to three minutes for git. The same data structure is used for receiving/indexing packs too, which allows to clone faster than git, with The above time was achieved with four threads, as opposed to three that git uses by default. For comparison, here is the same task with git. I think one major boost My reasoning for doing it like this is that |
Beta Was this translation helpful? Give feedback.
-
@Byron Nice! I would love to see a faster |
Beta Was this translation helpful? Give feedback.
-
This is incredibly impressive scalability now.
You may want to rate-limit the
Watching With the
Pack indexing:
You're right that it spends most of its time limited by decompression, then spends a few seconds using many CPUs, then finishes. Regarding decompression performance, try replacing Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel? |
Beta Was this translation helpful? Give feedback.
-
Quick informal benchmarks of DEFLATE implementationsFor test data, I tarred up the contents of use std::io::Read;
use std::time::Instant;
use flate2::Compression;
use flate2::bufread::{GzDecoder, GzEncoder};
fn main() -> anyhow::Result<()> {
let mut dest = Vec::with_capacity(1024*1024*1024);
let uncompressed_data = std::fs::read("/tmp/usr-bin.tar")?;
let time = Instant::now();
let mut gz = GzEncoder::new(uncompressed_data.as_slice(), Compression::default());
gz.read_to_end(&mut dest)?;
let elapsed = time.elapsed();
eprintln!("Compressed {} bytes to {} bytes in {:?}", uncompressed_data.len(), dest.len(), elapsed);
dest.clear();
let compressed_data = std::fs::read("/tmp/usr-bin.tar.gz")?;
let time = Instant::now();
let mut gz = GzDecoder::new(compressed_data.as_slice());
gz.read_to_end(&mut dest)?;
let elapsed = time.elapsed();
eprintln!("Decompressed {} bytes to {} bytes in {:?}", compressed_data.len(), dest.len(), elapsed);
Ok(())
} For each test, I compiled miniz (default pure-Rust implementation)Compressed 462356480 bytes to 161332179 bytes in 25.073941226s miniz-sys (
|
Beta Was this translation helpful? Give feedback.
-
Whoop whoop, this feels like being knighted, and all the work totally paid off. Iteration after iteration after iteration, and it really was just Rust pushing me into doing "what's right", which also happens to be fast as a freebie.
I will try, is there a platform you would recommend for that? Thus far I was unable to obtain the tiny kernel pack, 1.4GB only, as the clones would always abort midway given my incredible 6kb/s download speed. Alternatively, if you have an easy way to share that file, I would be happy to give downloading that a try.
To me it looks like it's still a tiny fraction, but I might misread what is shown here. My disposition for this one is to leave it, as I am happy with the small overhead, and that turning it off works so well. It's all monomorphic and the branch predictor certainly has an easy time seeing that this thing is always off or always on :D.
This all comes down to the chunk size, which right now is static and computed using a fantastic algorithm that probably should be done by someone who knows math. As the total of chunks can be recomputed I could imagine one know that in the last N chunks (N = num cores), we have to split them smaller to feed all threads. If that is implemented, I could imagine this ~6.2 number to go down to ~5.2 even. Definitely something to look into. Maybe as a first simple option, there could be a 'chunk-size' flag to override the batch size to tune it based on experience. Maybe that way it's already possible to squeeze out another half a second or so. It's odd though that the first few seconds there isn't enough work available for all threads to start right away, and might indicate that the single-threaded producer is not fast enough.
It's great this one also got faster, even though the core algorithm is definitely unchanged. Probably it gets started more quickly now because the pack verification is done in parallel with everything else - thanks again for giving me the nudge in that direction!
Yes, but only if you know where these streams are. Right now the only way to find out (when only looking at the pack file) is to set the decompressor past the header of a pack entry and let it decompress till done. This marks the beginning of the next entry. If it would somehow be possible to find entries more quickly, an index can be built much faster. Index traversal already does limited only by sequential/burst IO speeds by skipping through the pack to read entry headers only. It can do that as it know where each entry begins though (and it's faster than seeking :D). Since you seem to be involved with zlib much more than I am, then let me restate that finding the beginning or end of a zlib stream reliably without decompressing it would be a major thing for clone speeds of git. And while writing this, the quick zlib comparison came in: Thanks so much! Switching to zlib-ng will be very impactful! Getting a better zlib decompressor/compressor is definitely high up on the list and I would be happy to allocate some time for it once you give the go. Ideally it's possible to avoid pulling in Flate2, but I'd do it if you think it's the right thing to do after all (as long as streams can be re-used thanks to a fast reset). Thanks again, it's great enjoying your support, and you are having quite an impact on this young project, shaping it to the better! |
Beta Was this translation helpful? Give feedback.
-
You're doing incredible work, and I'm glad it's working out so well.
Sure: <redacted by @Byron> I'd suggest downloading with curl and making sure you can resume in case it gets aborted. Please let me know when you have them so I can remove them. I'll hang onto those same objects for future testing consistency. SHA256 checksums:
It adds 0.6s to a 10.6s build, which is a 5.7% increase. Not huge, but it's a non-trivial amount of measurement overhead. That said, I completely agree that you don't need to prioritize it. Long-term, there'd be value in improving it just a little bit, since people may want a simple progress indicator for long operations, without needing the full detail you currently capture; however, so many other things are higher priority.
Interesting! And yeah, I'm impressed with how good the progress data is. If you ever want to work on it, we should do some actual profiling on the 96-way to see where the overhead is. (It'd be easy to capture
I did find that algorithm when I was looking for the chunk size, though I didn't dig into the details. As a quick hack, I tried dropping the What number would you expect to see the
When I looked at the full
Makes perfect sense. Would it be possible, with some care, to use the index to figure out in advance which objects will be needed again and which ones won't? Could you compute a small DAG of objects you need for deltas (without storing the objects themselves), and use that to decide the order you process objects in? There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)
Ah, I see. So, anything other than indexing can use the index file to know where all the objects are and decompress them in parallel, but indexing itself has to decompress sequentially to find those offsets? Makes sense. It's unfortunate that the pack file format stores the decompressed size of each object, but not the compressed size.
Oooh, that's a fun idea. I don't think that it's possible to find the end of the stream in constant time, and I don't think pattern-matching bytes would be possible. But I think that with some care, it would be possible to rapidly skip through a DEFLATE-compressed stream without decompressing it, if you don't actually care about the data itself and you only want to find the end of it. You could skip many of the actual DEFLATE steps, and just decode the symbols from the Huffman tables and skip over match/distance repeats without constructing the decompressed data. See https://en.wikipedia.org/wiki/DEFLATE and https://tools.ietf.org/html/rfc1951 for details on the DEFLATE format. I filed zlib-ng/zlib-ng#714 requesting an implementation of this in zlib-ng.
As far as I know, I'm not aware of As far as I can tell, I think zlib's reset should be efficient and not involve reallocating or zeroing any state. I do think you should go ahead and switch to something that's based on EDIT: and using flate2 would give you the ability to pick between performance and pure-Rust without having to implement that support yourself.
Thank you for all the incredible work you're doing; I'm more than happy to keep benchmarking and keep cheering you on, as well as providing the occasional suggestion on approaches. :) |
Beta Was this translation helpful? Give feedback.
-
The download is completed and I will use these files from hereon out. It seems the main difference to the ones produced by github is a greater zlib compression factor.
I couldn't agree more! Learning from the previous iteration (the one that led to 0.3), I should really try not to be smart until there is some actual measurements. After all, that first version of the streaming code contained quite some complication to be able to keep decompressed streams in memory, for zero benefit except for boosting data structure size considerably (even if that feature wasn't used). Luckily, I feel not much time was lost, and fortunately it was never painful to implement it in the first place :D (I
I think every thread gets 50 base objects and will traverse the tree from there. The actual amount of work to do will differ greatly between the threads, there should be a lot of variance. I chose a small amount to avoid starvation, and with 1.2 million base objects, there should be enough chunks to keep most threads busy all the time.
That's an interesting observation, and that seems to imply finding a way to spawn many threads faster would be a possible micro-optimization here. If every thread would take 10ms to start, we need a second to start
I think that's worth an attempt - it's easy to inject a different cache implementation. It's worth noting that there is tension between having a smarter cache that adds latency to an algorithm that otherwise is instantly producing results.
That's an interesting one, and I dreamt about streaming deltified objects a long time ago π . There even is an attempt to do that written in C, which I believe even works at least superficially. The problem would still be latency and memory consumption. The way I see it, one would have to merge all deltas into one, a relatively costly operation for longer chains that costs at least one allocation. Then there is the base object, which is needed for most chunks (presumably) as deltas try their best not to copy new data. Keeping around the 'whole object', which I assume here is the base object isn't necessary, but most of the time it will have to be loaded to resolve portions of the delta. Seeing how fast the current implementation is I am absolutely sure that the 'streaming delta resolution' will be slower and cost more compute, and I feel that this would really only be interesting for servers who want to stream a lot of deltified objects at reduced memory footprints, trading CPU time for it and to some extend, IO if they chose to reload the base object on each request of 4kB. The more I think about it, the more I believe implementing this at a net-win for the user requires a lot of prior research to identify which tradeoffs should be made there.
Awesome, I am eagerly watching and promise that before doing anything related to creating packs, I will assure zlib-ng can be toggled on. The current zlib module (in
Maybe the way to get a comfortable high-level interface is to switch to Flate2. Maybe I was hesitant for the wrong reasons - now that I have a custom implementation, it should be easy to throw it out and replace it with Flate2 to see if this negatively impacts performance after all. Maybe, maybe, not using it was a bit of a premature optimization, so definitely something I will look at as well. I think as long as there is streaming inflate and deflate, fast resets, and ways to deflate or inflate everything at once into pre-allocated buffers,
Indeed, and I think it might be worth looking into what pack format V3 is going to be like. If that's fixed there, there might not be huge payoffs in making pack format V2 faster.
Yes, let's do that π! |
Beta Was this translation helpful? Give feedback.
-
When I saw the last few threads finishing up, the
No, it's not that only some of the threads had started, it's that none of the threads had started, and it wasn't 1s, it was several full seconds. As far as I can tell, gitoxide spent several seconds not processing objects at all; it was doing other processing that wasn't running a thread per CPU.
I don't think you'd have to keep the base object around; it should be possible to just keep the deflate streams active for an object currently being streamed. Tiny deltas (those smaller than a zlib state) could be kept in memory, while larger deltas or base objects could be kept as zlib streams plus the most recent chunk of data. Each time the caller asks for more data, you could process the deltas and walk the zlib streams forward. (That's assuming that deltas reference the base object more-or-less sequentially, otherwise this might well be a loss.) The goal would be to maintain a kind of "cursor" that can walk forward through an object that might be several MB but isn't being read all at once. (The whole pack would already be mmapped, and the corresponding index loaded.) I don't know if this would work well or not, but it seems worth an experiment. |
Beta Was this translation helpful? Give feedback.
-
I think what you probably saw is a thread ending up with 8000 objects to handle, based on the 50 base objects it got. During traversal, threads use unbounded progress reporting, as the actual number isn't known ahead of time except that it's at least 50. It might be that for some reason the chunksize computation is off, that would be a bug. Tests seem to indicate that it shouldn't happen though for kernel-pack sized requests.
Aaah, good to know, this restores my faith in the performance of spawning threads :D. Indeed, it takes a moment until the index is built, and all that should be visible. It should never do nothing visibly, that would be a bug as well.
I see, I didn't think even keeping the zlib streams around under the given conditions, it makes perfect sense and seems to be the extra mile. I think once the zlib handling is refactored to be nicer and support zlib-ng, such an experiment could be run. My suggestion is to do thorough analysis in an issue beforehand to think about the involved tradeoffs and the desired outcome, maybe along with a test that triggers current, unwanted behaviour that one tries to improve. |
Beta Was this translation helpful? Give feedback.
-
The latest version of the
|
Beta Was this translation helpful? Give feedback.
-
I can't wait to try it out! By the looks of it, after this iteration (clone) will be an iteration to get basic push working. |
Beta Was this translation helpful? Give feedback.
-
The current verifying and indexing code would also benefit from faster decompression. |
Beta Was this translation helpful? Give feedback.
-
Absolutely, they will benefit 'for free' when the changes have landed. Maybe for you that would be trivial to try out even, given your knowledge in the field, and maybe you would even conclude that |
Beta Was this translation helpful? Give feedback.
-
Results on a MacBook Air M1, using INTEL binaries:
And for comparison, the same with the
|
Beta Was this translation helpful? Give feedback.
-
And here is the result with ARM binaries on a MacBook Air with M1 chip.
I had to run the
What's incredible is that the effective user time of git indicates it's using no more than 4 threads, bringing its user time (and possibly efficiency) down to ~93s, whereas gitoxide uses all cores and more user time only for a small gain.
When running again and limiting the process to 4 threads, the user time is still significantly higher, making git a clear winning when it comes to efficiency no matter what.
|
Beta Was this translation helpful? Give feedback.
-
I tried switching to
And for a baseline,
|
Beta Was this translation helpful? Give feedback.
-
The latest results for verifying a pack on the M1Pro (10 cores). It's notable how much faster M1Pro has gotten at that task, with Sha1 reaching 1.8GB/s on a single core.
And here is git (limited to 4 threads as it doesn't scale beyond that, getting slower instead).
And an apples vs apples comparison with
One sees that it scales pretty well with cores, maybe even perfectly, as the parallel part of the operation is indeed not requiring any synchronization to speak of. And now index creation, meaning decompressing and hashing every object.
And the same in git.
|
Beta Was this translation helpful? Give feedback.
-
When trying to beat
git verify-pack
Attempt 1
I remembered timings on a cold cache that indicated something around 5:50min for git to run a verify pack on the linux kernel pack. However, turns out that if the environment is a little more controlled, git is still considerably faster than us despite using an LRU cache and despite using multiple cores quite efficiently.
Observation
Git uses a streaming pack approach which is optimized to apply objects inversely. It works by
We work using a memory mapped file which is optimized for random access, but won't be very fast for this kind of workload.
How to fix
Wait until we have implemented a streaming pack as well and try again, having the same algorithmical benefits possibly faired with more efficient memory handling.
Git for some reason limits the application to 3 threads, even though we do benefit from having more threads so could be faster just because of this.
The streaming (indexing) phase of reading a pack can be parallelised in case we have a pack on disk, and it should be easy to implement if the index datastructure itself is threadsafe (but might not be worth the complexity or memory overhead, let's see).
Beta Was this translation helpful? Give feedback.
All reactions