Investigate why Solana is so slow. #6

Eh2406 · 2024-08-14T20:34:23Z

Currently the Solana crates are filtered out of our test suite here. For some reason they take an extraordinarily long time to resolve (both for the existing resolver and even more so for PubGrub). It is not clear why they are difficult. Presumably they have some share dependency with some pattern that the resolvers do not like. What dependency, what pattern, we do not know. Perhaps if we had a smaller example that still exemplified the problematic behavior it would be easier to see the problem.

Eh2406 · 2024-08-19T15:11:19Z

Here is the generated index file for one of the particularly problematic crates that was filtered out.
[email protected].

Eh2406 · 2024-08-21T15:36:02Z

Focusing in on crates that had a high ratio of runtime to should cancel call count, pulled out the solana-transaction-status for investigation. After applying some minimization tools, here is a file that takes ~25s and is only 350k lines long. Which may or may not be useful to look at.
[email protected]

Edit: [email protected] was fixed with 7d6a3f0

x-hgg-x · 2024-10-07T21:15:24Z

It seems that there is a lot a backtracking in the cargo resolver for specific versions of solana-core.

After checking the dependencies, I can see why:

Each of the 78 versions of solana-core between 1.0.0 and 1.5.14 has a direct dependency to the other 20+ solana-X crates of the form ^1.x.y, where 1.x.y is the version of solana-core (this also applies for all solana-X crates before 1.5.14).
Each of the 252 versions of solana-core between 1.5.15 and 1.18.25 has changed the format of the solana-X dependencies, which is now =1.x.y (this also applies for all solana-X crates after 1.5.15).

Since the resolver tries to get the most recent compatible version if possible, if we set solana-core = "=1.0.0" in a Cargo.toml the resolver will try all combinations of the 300+ versions for the 20+ solana crates without success, before finding a solution where all solana crates are at version 1.0.0. In my machine the resolution took almost 10min with cargo generate-lockfile, and 1m30s if I use -Zminimal-versions.

For versions 1.5.15 and above, the resolution is very fast since there is only one possible solution for the solana crates due to the = in the version requirement.

Eh2406 · 2024-10-07T23:25:12Z

I guess I'm missing a detail, for versions <=1.5.14 what about the structure requires all the version numbers match?

The general question is can we find a smaller example that has the same reason for being slow? The easiest to debug are samples where every crate has a regular structure, and the overall runtime is exponential (or at least cubic) in the number of crates/versions. But those are hard to create from real-world examples. I've also been able to find the problem by identifying a small change that makes a big difference. Like "removing this one feature of one dependency of one crate makes things twice as fast" help me track down the [email protected] problem.

Looking at crates with a high -Zminimal-versions time to normal resolution time is a good idea I have not yet tried. Thanks.

x-hgg-x · 2024-10-08T16:14:15Z

By looking at the logs using CARGO_LOG='off,cargo::core::resolver::conflict_cache=debug', I managed to reproduce and minimize the slowdown. The main problem is that many versions of solana-crate-features have a dependency of cc = ^1.0.45 with the num_cpus feature, which is only available until cc v1.0.49 (see https://lib.rs/crates/cc/versions), so there is a lot of incompatibilities.

Here is a test showing this behavior:

#[test]
fn test() {
    let mut reg = Vec::new();

    const V: usize = 100;

    reg.push(pkg_dep_with(("cc", "1.0.0"), vec![], &[("f", &[])]));
    for v in 1..V {
        reg.push(pkg(("cc", format!("1.0.{v}"))));
    }

    reg.push(pkg_dep(
        ("solana-core", "1.0.0"),
        vec![
            dep("cc"), // resolve is instantaneous without this dependency
            dep_req("solana-1", "1.0.0"),
        ],
    ));

    const N: usize = 100;
    const C: usize = 3;

    for v in 0..N {
        let version = format!("1.0.{v}");
        for c in 1..C {
            reg.push(pkg_dep(
                (format!("solana-{c}"), &version),
                vec![dep_req(&format!("solana-{}", c + 1), &version)],
            ));
        }
        reg.push(pkg_dep(
            (format!("solana-{C}"), &version),
            vec![dep_req("cc", "1.0.0").with(&["f"])],
        ));
    }

    let deps = vec![dep("solana-core")];
    let res = resolve(deps, &reg).unwrap();
    println!("{:?}", res);
}

Here is the results with various values of C and N when V = 100, when running the test in release mode:
test-logs.tar.gz

With N = 99 the resolve is instantaneous, but with N = 100 it takes 12s, so there seems to be a threshold somewhere.

We can see that when C >= 2, the resolve time is linear over C but cubic over N:

	N=100	N=200	N=400
C=1	0.25s	0.5s	1.4s
C=2	7s	44s	314s
C=3	12s	77s	537s
C=6	30s	149s	1107s
C=12	99s	447s	2393s

When plugging these test cases in the SAT resolver, there is no slowdown at all: a solution is always found in less than 10ms.

Another thing: activate_deps_loop() is called in a loop in the resolver, and each time the conflict cache restart from zero since it is a local variable, so there is some duplicate work:
https://github.com/rust-lang/cargo/blob/aea3404eaea3f613a5f541abc1218b8eadd991ff/src/cargo/core/resolver/mod.rs#L140-L143.

x-hgg-x · 2024-10-08T16:43:29Z

When removing the cc dependency for solana-core, the resolve is instantaneous since it eliminates directly all versions of cc without the f feature.
When keeping this dependency, I think this is not possible because of feature unification.

x-hgg-x · 2024-10-08T16:54:45Z

(Note: I tried to run the sat resolver using infos from the [email protected] file, but the target platform for dependencies is missing in the file, so the generated summaries are invalid).

x-hgg-x · 2024-10-08T17:03:58Z

After more tests, it seems that the threshold occurs when N >= V.

So in the example above, setting V to 101 instead of 100 completely eliminates the slowdown, even if the number of candidate versions has increased.

Eh2406 · 2024-10-08T21:29:28Z

That is very intriguing results! I wonder if similar slowdowns happen with the PubGrub based resolver from this crate.

(Note: I tried to run the sat resolver using infos from the [email protected] file, but the target platform for dependencies is missing in the file, so the generated summaries are invalid).

It is really unfortunate that we have two different testing infrastructures one here and one in Cargo's repo. I wonder how we could bring them closer to being inter-compatible.

After more tests, it seems that the threshold occurs when N >= V.

There are various heuristics that use "smallest number of candidates" to decide which requirements to resolve first. I wonder if that's why this cut off occurs.

x-hgg-x · 2024-10-08T21:29:59Z

After a quick perf, 70% of the time is spent hashing in this structure:
https://github.com/rust-lang/cargo/blob/ac39e69a0fff26f0b8c5e886c42175e8b58a6403/src/cargo/core/resolver/context.rs#L18-L29

Eh2406 · 2024-10-08T21:31:37Z

I like to collect perf results with https://github.com/mstange/samply so that the results can be shared with just a few clicks.

x-hgg-x · 2024-10-08T21:54:25Z

Here is two perf profiles with C = 3 and N = 100 (read with https://profiler.firefox.com):
cargo.tar.gz
For the second profile I replaced the hasher with rustc-hash for the 4 fields in the ResolverContext of type im_rc::HashMap or im_rc::HashSet, and it shows a 60% decrease in resolution time.

Eh2406 · 2024-10-08T22:29:48Z

That seems like an easy win! I look forward to your a PR to Cargo.

x-hgg-x · 2024-10-08T22:59:34Z

That is very intriguing results! I wonder if similar slowdowns happen with the PubGrub based resolver from this crate.

For this test it takes less than 2s to resolve using the PubGrub resolver with C = 12 and N = 400.

But I observe the same threshold when N >= V, and the same trend when N increases, but all resolution times are much smaller than with the current cargo resolver.

Eh2406 · 2024-10-09T18:21:36Z

I don't know if this is helpful, but I just ran this project for the Solana crates. out_solana.zip here's a CSV of the output. There may be valuable insight by comparing consecutive versions of crates. For example:

I don't see a big jump in time/cargo_time comparing versions of solana-core just before 1.5.15 with just after. So the changes in ^ to = don't seem to have mattered.
solana-core 1.2.18 and 1.2.19 are the slowest for cargo (cargo_time of >250s). But 1.3.1 is half that time. And 1.2.9 is 43s.
solana-archiver-lib 1.1.12 is the slowest for pub (time of >96s). But 1.1.13 is 73s.

If you can push the rustc-hash branch somewhere I will happily do another data collection to see how big a difference that makes overall!

x-hgg-x · 2024-10-09T20:31:09Z

Here is my branch: https://github.com/x-hgg-x/cargo/tree/resolver-perf
I also tried to use nohash instead of rustc-hash, and it further improves resolve time by another 30% (using C = 3 and N = 100) :

commit	duration
master	16s
rustc-hash	6.7s
nohash	4.6s

Firefox perf profiles: perf.tar.gz

Eh2406 · 2024-10-09T22:37:35Z

Running all crates in the index went from 315.49min to 236.12min when going from whatever cargo commit I was on to your rustc-hash commit. That is a very good improvement! (I didn't compare to the nohash because this tool runs resolution in parallel, and your branch adds PackageId::new and a ACTIVATION_KEY_CACHE which will probably make things difficult to run in parallel.) Looking at the disaggregated data there are very few examples that got slower with this change. solana-core 1.2.18 and 1.2.19 are now in the 190s range! I look forward to the PR!

x-hgg-x · 2024-10-09T22:41:55Z

I don't see a big jump in time/cargo_time comparing versions of solana-core just before 1.5.15 with just after. So the changes in ^ to = don't seem to have mattered.

Versions between 1.14.15 and 1.18.14 didn't succeed because of yanked crates.

(Note: I needed to write a small Rust program to sort the csv by increasing name and version to better analyze it, so it should be done before writing the csv.)

x-hgg-x · 2024-10-09T22:51:28Z

(I didn't compare to the nohash because this tool runs resolution in parallel, and your branch adds PackageId::new and a ACTIVATION_KEY_CACHE which will probably make things difficult to run in parallel.)

PackageId::new already exists and already locks a Mutex. My commit added a second lock, but since the PackageId are only constructed once when loading a manifest there should be no visible slowdown even when running in parallel.

Improve resolver speed ### What does this PR try to resolve? This PR improves the resolver speed after investigations from Eh2406/pubgrub-crates-benchmark#6. ### How should we test and review this PR? Commit 1 adds a test showing a slow case in the resolver, where resolving time is cubic over the number of versions of a crate. It can be used for benchmarking the resolver. Comparison of the resolving time with various values of `N=LAST_CRATE_VERSION_COUNT` and `C=TRANSITIVE_CRATES_COUNT`: | | N=100 | N=200 | N=400 | |------|-------|-------|-------| | C=1 | 0.25s | 0.5s | 1.4s | | C=2 | 7s | 44s | 314s | | C=3 | 12s | 77s | 537s | | C=6 | 30s | 149s | 1107s | | C=12 | 99s | 447s | 2393s | Commit 2 replaces the default hasher with the hasher from `rustc-hash`, decreasing resolving time by more than 50%. Performance comparison with the test from commit 1 by setting `LAST_CRATE_VERSION_COUNT = 100`: | commit | duration | |-----------------|----------| | master | 16s | | with rustc-hash | 6.7s | Firefox profiles, can be read with https://profiler.firefox.com: [perf.tar.gz](https://github.com/user-attachments/files/17318243/perf.tar.gz) r? Eh2406

x-hgg-x · 2024-10-10T21:46:34Z

Another thing: activate_deps_loop() is called in a loop in the resolver, and each time the conflict cache restart from zero since it is a local variable, so there is some duplicate work:
https://github.com/rust-lang/cargo/blob/aea3404eaea3f613a5f541abc1218b8eadd991ff/src/cargo/core/resolver/mod.rs#L140-L143.

For solana-archiver-lib = "1.0.2", the conflict cache is cleaned in the middle of the resolving when the registry has pending index requests, so it does the same work 7 times.

This may explain the long time spent dropping the ResolverContext (rust-lang/cargo#14665 (comment)), and the not consistent benchmarks (rust-lang/cargo#14665 (comment)).

x-hgg-x · 2024-10-10T22:06:01Z

By moving the conflict cache out of the loop, the resolving time for solana-archiver-lib = "1.0.2" drops from 400s to 50s.

x-hgg-x · 2024-10-10T22:07:51Z

And the resolving time for solana-core = "=1.2.18" drops to 25s (!).

Eh2406 · 2024-10-10T22:09:24Z

The loop has to do with requesting data from the network. By processing all the existing data it determines what new data is required. Then it waits for that data to be available and repeats. This should all be deterministic.

Whether the conflict cash can be shared will require more thought than I can give it at the moment. It has to do with whether new facts could ever invalidate a conflict that sorry been registered. (I think you're right that it can be shared, but I'm not confident.)

I don't see how it could affect the long dropping time for ResolverContext can you elaborate? I also don't see how it could cause nondeterminism?

And it definitely is not related to the slow times were observing when running this tool, because this tool peloads all of the data and does not do that loop.

x-hgg-x · 2024-10-10T23:14:01Z

I don't see how it could affect the long dropping time for ResolverContext can you elaborate?

The ResolverContext contains a lot of data so the drop takes several seconds. But this value is multiplied by the 7 times we rebuild and drop the ResolverContext.

I also don't see how it could cause nondeterminism?

Here is two consecutive runs for solana-core = "=1.0.5" on master. I also printed the instant when the loop is restarted:

Run 1 (236s):

    Updating crates.io index
110.065434ms
113.120678ms
767.306032ms [===========>                     ] 1 complete; 1 pending                                       
769.198182ms
1.479256828s [=================>               ] 32 complete; 26 pending                                     
5.338349626s [====================>            ] 207 complete; 61 pending                                    
   Resolving dependency graph...
10.957888526s[=======================>         ] 407 complete; 16 pending                                    
   Resolving dependency graph...
48.325845407s[==========================>      ] 493 complete; 5 pending                                     
   Resolving dependency graph...
86.382807965s[=============================>   ] 526 complete; 2 pending                                     
   Resolving dependency graph...
124.414130285s=============================>   ] 528 complete; 5 pending                                     
   Resolving dependency graph...
162.600263637s=============================>   ] 533 complete; 1 pending                                     
   Resolving dependency graph...
200.951554361s=============================>   ] 534 complete; 0 pending                                     
   Resolving dependency graph...
     Locking 513 packages to latest compatible versions

Run 2 (310s):

    Updating crates.io index
108.795614ms
111.606964ms
760.045514ms [===========>                     ] 1 complete; 1 pending                                                                                                                                                 
763.731469ms
1.451621731s [=================>               ] 48 complete; 10 pending                                                                                                                                               
4.560725379s [====================>            ] 237 complete; 20 pending                                                                                                                                              
   Resolving dependency graph...
9.873827135s [=======================>         ] 412 complete; 8 pending                                                                                                                                               
   Resolving dependency graph...
41.735024564s[==========================>      ] 488 complete; 9 pending                                                                                                                                                
   Resolving dependency graph...
79.806753686s[=============================>   ] 518 complete; 5 pending                                                                                                                                                
   Resolving dependency graph...
118.547341727s=============================>   ] 523 complete; 9 pending                                                                                                                                                
   Resolving dependency graph...
157.759883082s=============================>   ] 532 complete; 1 pending                                                                                                                                                
   Resolving dependency graph...
196.723956203s=============================>   ] 533 complete; 0 pending                                                                                                                                                
   Resolving dependency graph...
235.680510433s=============================>   ] 533 complete; 1 pending                                                                                                                                                
   Resolving dependency graph...
274.420162854s=============================>   ] 534 complete; 0 pending                                                                                                                                                
   Resolving dependency graph...
     Locking 513 packages to latest compatible versions

And it definitely is not related to the slow times were observing when running this tool, because this tool peloads all of the data and does not do that loop.

Yes, so I should use this repo to do my benchmarks.

Eh2406 · 2024-10-11T02:09:16Z

But this value is multiplied by the 7 times we rebuild and drop the ResolverContext.

Ah, yes that makes sense. Thanks.

The ResolverContext contains a lot of data so the drop takes several seconds.

One thing we observed in this repo is that using tikv_jemallocator or mimalloc made a big difference for multithreaded performance. I think I also measured that it helped with single threaded performance. We should double check what allocator cargo is using, and what allocator rustc is using, and if they're not the same switch cargo to the approved rustc one to see if it helps. That may make a big difference to users at very little effort.

Here is two consecutive runs for solana-core = "=1.0.5" on master. I also printed the instant when the loop is restarted:

There's definitely non-determinism. When looking at individual packages, I've been observing it too. I'm not convinced that the restarts are the cause of the non-determinism, or if there's another root cause we should be looking for.

Yes, so I should use this repo to do my benchmarks.

I didn't mean to go that far. On one hand, this repo is set up to make it ~~convenient~~ possible to do reasonably reproducible benchmarks over a representative set of all packages. But by testing cargo directly, you managed to find some significant speedups that are directly applicable to users today. Both approaches have their value and uses. I'm happy to support you on whatever effort brings you joy. I'm just trying to keep track of which issues were making progress on.

x-hgg-x · 2024-10-11T16:43:09Z

We should double check what allocator cargo is using, and what allocator rustc is using, and if they're not the same switch cargo to the approved rustc one to see if it helps.

There is an option in rustc to enable jemalloc, and I think it is enabled for the final binary:
https://github.com/rust-lang/rust/blob/484c8e78cb915bf1aca57ec62953324fe76b5d37/config.example.toml#L723-L725
https://github.com/search?q=repo%3Arust-lang%2Frust%20rust.jemalloc&type=code

I'm not convinced that the restarts are the cause of the non-determinism, or if there's another root cause we should be looking for.

There is at least a difference between a fully offline registry where everything is always ready and the online registry which needs to download from index.crates.io. For solana-core = "=1.2.18", resolving with an offline registry takes 200s, but only 80s in the last iteration of the loop for an online registry. It seems that much less candidates are considered, since pending queries cannot add new candidates for the resolver. So the path taken by the resolver can vary depending of which query is ready at a given time, and it also means that the conflict cache cannot be reused. If this is the case, it may be possible to find a case where resolving with an online registry will error, but not when using an offline registry.

I didn't mean to go that far. On one hand, this repo is set up to make it convenient possible to do reasonably reproducible benchmarks over a representative set of all packages. But by testing cargo directly, you managed to find some significant speedups that are directly applicable to users today. Both approaches have their value and uses. I'm happy to support you on whatever effort brings you joy. I'm just trying to keep track of which issues were making progress on.

It's all good, I also think that both approaches have their own advantages.

x-hgg-x · 2024-10-11T18:16:40Z

I have found the non-determinist bug: this code is incorrect:
https://github.com/rust-lang/cargo/blob/e8adba9d24257e14fef8793cecabb55b2fbcd0e3/src/cargo/sources/registry/http_remote.rs#L798-L811.

self.handle_completed_downloads() needs to be called after self.multi.perform(), or else we can return from the function when all downloads are finished but not processed.

We can see it in the above log:

235.680510433s=============================>   ] 533 complete; 1 pending                                                                                                                                                
   Resolving dependency graph...
274.420162854s=============================>   ] 534 complete; 0 pending                                                                                                                                                
   Resolving dependency graph...

Eh2406 · 2024-10-11T18:25:16Z

There is an option in rustc to enable jemalloc, and I think it is enabled for the final binary:

So that's what used for the compiler and LLVM. The next question is what used for cargo? Which appears to be the system default allocator https://github.com/search?q=repo%3Arust-lang%2Fcargo%20global_allocator&type=code We should try jemalloc and see if it helps!

resolving with an offline registry takes 200s, but only 80s in the last iteration of the loop for an online registry.

That's very intriguing. There are several layers of caching that might explain this difference. The most direct is the reuse of the RegistryQueryer, but also registries themselves cash reading from disk. I don't know if that's enough to explain a 2x difference.

It seems that much less candidates are considered, since pending queries cannot add new candidates for the resolver. So the path taken by the resolver can vary depending of which query is ready at a given time, and it also means that the conflict cache cannot be reused. If this is the case, it may be possible to find a case where resolving with an online registry will error, but not when using an offline registry.

In my understanding, the final loop of resolution and the only loop when off-line should be identical. Turning on logging and comparing the output between the two different scenarios should make this clear. ( Or adding debug statements that print out the precisely relevant info if the logs are too verbose. ) If they're not the same, I would really love to understand why!

By definition, it's the final loop because nothing returned pending. Therefore it's getting the same input as the off-line mode where everything returns ready on the first request. By not sharing anything that could affect the order of operations in the resolver between calls to the resolver we ensure that they get the exact same output.

But the more I talk about it, the more I'm convinced that you are right about being able to share the conflict cash. The code is set up so that if an incremental resolution fails then we know that waiting for the network will not cause it to succeed, we can report it to the user as soon as we find the error. Which is incompatible with the conflict cash storing a problem that would be invalidated by future network data.

x-hgg-x · 2024-10-11T18:45:36Z

Turning on logging and comparing the output between the two different scenarios should make this clear. ( Or adding debug statements that print out the precisely relevant info if the logs are too verbose. )

Here is the conflict cache logs for solana-core = "=1.2.18":
logs.tar.gz
Files iter1.log to iter8.log are logs for a resolve using an online registry (each file representing a loop iteration), and full.log is the log using the repo for the offline registry.

We can see that iteration 4 to 8 are the same using the online registry, and the conflict cache using the offline registry is much bigger than for the iteration 8.

Here is the conflict cache log when sharing the conflict cache between iterations:
shared.log

Eh2406 · 2024-10-12T01:30:47Z

That is so interesting! I wonder why the last loop didn't match the full run. It's always so intriguing to discover one's knowledge is incomplete. Thank you! I wonder where the two runs first diverge? Not just for the conflict clauses they had, but for every decision they make.

The fact that for 4 iterations we generate the same conflicts, definitely backs up your measured evidence of huge performance savings by sharing it. Add that to your list of future PR's!

Due to the holiday, I will probably have limited availability to converse on the subject, despite its deep interest to me, until Monday.

x-hgg-x · 2024-10-12T22:00:21Z

I wonder where the two runs first diverge? Not just for the conflict clauses they had, but for every decision they make.

After analysis the difference is that adding solana-core = "=1.2.18" in a Cargo.toml doesn't take dev dependencies into account. After directly copying the Cargo.toml of the dependency, I observe exactly the same behavior between an offline registry and the last iteration for an online registry.

x-hgg-x · 2024-10-12T23:36:07Z

When using an online registry, deps which are not yet fetched are not added to the candidate summary:
https://github.com/rust-lang/cargo/blob/82c489f1c612096c2dffc48a4091d21af4b8e046/src/cargo/core/resolver/dep_cache.rs#L248-L266.

On the other hand, if a dependency query returns Poll::Ready, then all compatible summaries are returned, so we cannot have a partial view where only some compatible summaries would be returned.

This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache should be able to be shared.

Comparing to an offline registry or the last iteration for an online registry, doing the resolve with a shared conflict cache and an online registry makes the order of operations slightly differs: the content of the cache is different but the resolve result after the last iteration is the same.

Here is the sorted conflict cache logs for when sharing it: shared_sorted.log
For comparison, the same logs for an offline registry: full_sorted.log

x-hgg-x · 2024-10-13T09:41:24Z

We should try jemalloc and see if it helps!

Here is the perf profiles for solana-core = "=1.0.5" without dev deps, where the baseline is using the nohash optimization and the fix from rust-lang/cargo#14680: profiles.tar.gz

4 profiles are compared:

baseline
switch to jemalloc (see branch https://github.com/x-hgg-x/cargo/commits/resolver-perf-3)
optimized rcvec: RcVecIter struct is optimized so that we don't need to clone the inner T when iterating (see branch https://github.com/x-hgg-x/cargo/commits/resolver-perf-4)
optimized rcvec + switch to jemalloc

profile	duration
baseline	213s
rcvec	202s
jemalloc	193s
rcvec+jemalloc	189s

Output of perf diff between profile 1 and 4:

# Baseline  Delta Abs  Shared Object  Symbol                                                                                                                                                                                                                                                                                                                                           
     8.74%     -6.85%  cargo          [.] cargo::core::resolver::activate_deps_loop
     3.85%     +4.24%  cargo          [.] <cargo::core::dependency::Dependency as core::hash::Hash>::hash
    11.95%     -4.21%  cargo          [.] cargo::core::resolver::RemainingCandidates::next
     2.98%     -2.98%  libc.so.6      [.] _int_malloc
     2.03%     -2.03%  libc.so.6      [.] _int_free
     2.36%     +1.93%  cargo          [.] core::hash::Hash::hash_slice
     5.78%     +1.69%  libc.so.6      [.] __memmove_avx_unaligned_erms
     6.36%     +1.39%  cargo          [.] <bitmaps::bitmap::Iter<Size> as core::iter::traits::iterator::Iterator>::next
     2.75%     +1.37%  cargo          [.] alloc::rc::Rc<T,A>::make_mut
     1.50%     +1.09%  cargo          [.] cargo::core::resolver::activate
     1.30%     +1.06%  cargo          [.] cargo::core::resolver::conflict_cache::ConflictCache::conflicting
     3.24%     +1.05%  cargo          [.] <alloc::rc::Rc<T,A> as core::ops::drop::Drop>::drop
               +0.79%  cargo          [.] _rjem_sdallocx
               +0.66%  cargo          [.] tikv_jemallocator::layout_to_flags
     3.12%     +0.51%  cargo          [.] <core::hash::sip::Hasher<S> as core::hash::Hasher>::write

We see that the next thing to optimize will be the hashing of Dependency.

Eh2406 · 2024-10-14T20:03:56Z

This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache should be able to be shared.

I am convinced. Looking forward to seeing a PR for this.

Looks like jemalloc and rcvec are solid wins. More PR's for you! Although I would do them separately, because the review concerns of adding a new dependency are different from the review concerns of modifying an internal data structure.

fix(resolver): share conflict cache between activation retries ### What does this PR try to resolve? Found in Eh2406/pubgrub-crates-benchmark#6. Currently the resolve is done in a loop, restarting if there are pending dependencies which are not yet fetched, and recreating the conflict cache at each iteration. This means we do a lot of duplicate work, especially if the newly fetched dependencies has zero or few dependencies which doesn't conflict. This also means that the maximum depth of the dependency graph has a big impact on the total resolving time, since adding a dependency not yet seen means we will restart the resolve at the next iteration. Here is the output of the resolve for the master branch using `solana-core = "=1.2.18"` in `Cargo.toml`, printing the duration from the start after each iteration: ``` Updating crates.io index 103.142341ms [=====> ] 1 complete; 0 pending 118.486144ms [========> ] 2 complete; 0 pending 785.389055ms [===========> ] 62 complete; 0 pending 4.916033377s [==============> ] 277 complete; 0 pending 9.13101404s [=================> ] 439 complete; 0 pending 50.083251549s [====================> ] 520 complete; 0 pending 133.401303265s [=======================> ] 561 complete; 0 pending 214.120483345s [==========================> ] 565 complete; 0 pending 294.180677785s [=============================> ] 566 complete; 0 pending Locking 536 packages to latest compatible versions ``` To improve the situation, this PR moves the conflict cache outside the activation loop so it can be reused for all iterations. This is possible since when using an online registry, dependencies which are not yet fetched are not added to the candidate summary: https://github.com/rust-lang/cargo/blob/82c489f1c612096c2dffc48a4091d21af4b8e046/src/cargo/core/resolver/dep_cache.rs#L248-L266 On the other hand, if a dependency query returns `Poll::Ready`, then all compatible summaries are returned, so we cannot have a partial view where only some compatible summaries would be returned. This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache can be shared. Here is the output of the resolve with this PR: ``` Updating crates.io index 98.239126ms [=====> ] 1 complete; 0 pending 127.528982ms [========> ] 2 complete; 0 pending 821.601257ms [===========> ] 62 complete; 0 pending 4.67917132s [==============> ] 277 complete; 0 pending 5.933230172s [=================> ] 431 complete; 0 pending 14.74321904s [====================> ] 508 complete; 0 pending 24.607428893s [=======================> ] 546 complete; 0 pending 24.700610469s [==========================> ] 550 complete; 0 pending 24.757651875s [=============================> ] 551 complete; 0 pending Locking 536 packages to latest compatible versions ``` Besides the huge performance savings between iterations, sharing the conflict cache has the side effect of eliminating dead paths earlier in the resolve after a restart. We can see that the duration of all iterations using this PR is less than the duration of the last iteration on master, and that it needs to resolve less dependencies overall.

fix(resolver): avoid cloning when iterating using RcVecIter ### What does this PR try to resolve? Follow up of Eh2406/pubgrub-crates-benchmark#6 (comment). This PR modifies the internal `RcVecIter` struct so that we can iterate on its items without cloning them. This improves performance of the resolver because eagerly cloning `Arc<T>` is not free. Comparison of performance using `solana-core = "=1.0.5"` in `Cargo.toml`: | branch | duration | |------------|----------| | master | 213s | | PR | 202s | r? Eh2406

Eh2406 · 2024-10-15T22:16:41Z

Thank you for all the work you've done so far. It's made a dramatic improvement to the real-world performance for users of cargo.

The Jewish holy days come altogether this time of year. I will be unable to respond until Monday. I promise, it's almost over just this week and next week, and then I'll be back to real availability. In addition to my lack of availability, Eric and Ed also have limited availability for a while. (Vacation and a conference in China respectively.)

You seem to be coming up with ways to improve things on your own, I'm excited to see what else you come up with. If you're looking for things to do, the next step on the SAT tests would be adding features to PropTest. Alternatively, you could investigate the bug you made such a clear report for. If you want to stay focused on performance, I had a thought to reduce the drop time of ResolverContext.

ResolverContext is a copy of the critical data structures in case we need to jump back to the state to try a different decision. In general making data structures that are cheap to reset is hard, which is why we rely on im-rs. But two of these structures are "append only" and keep track of when they were modified. Specifically activations keeps track by storing a conflict age for each key, and links implicitly keeps track by whether the package is active (although it could also keep track explicitly by storing an age). So I think, for these two we don't need a stack of copies we can "just" remove the new entries when were back tracking. (Although I'd want to be really careful testing it. I'd want a commit where we add the new data structure without removing the old data structure and assert that they are equal every time either one is modified.) Straightforwardly we could use .retain to only keep the ones that are still active, in O(#active). If we switch to a order preserving map (indexmap we should be able to pop the most recent items for a O(distance backtracked). It should work in theory, but I haven't gotten at work in practice yet.

In this issue before we have also had some other leads. Which you could follow-up on if you wanted to. Like trying to create a minimal example for why some Solana crates take >20 sec on the PubGrub resolver. Or trying to improve interoperability between the testing infrastructure here and in the Cargo repo.

Anyway just some ideas and encouragement.

x-hgg-x · 2024-10-22T13:55:06Z

I experimented with the pubgrub crate to find why it is slow, and after finding the reason (a lot of allocations in the hoot loop), I tried to fix it in my fork: https://github.com/x-hgg-x/pubgrub.

I have done two main improvements:

Use u32 for package type, so that we don't clone it everywhere and its hash doesn't need memory access.
Limit the number of versions of a package to 63, so that we can use u8 for version type and u64 for version set and term types, using it as a bitset. A dependency provider will need to get all possible version of package, then sort them and compute the version as the sorted index. To support more than 63 versions we can use virtual packages to limit the resolver to only one true version per package (see x-hgg-x/pubgrub@691dece).

By doing all this, there is almost no remaining allocations in the resolver hot loop, so I expect the resolving time to be divided by 5 to 10.

Flamegraph before (total time 30s): flamegraph.1.svg.tar.gz

Flamegraph when using u32 as package type (total time 15s): flamegraph.2.svg.tar.gz

Flamegraph when using u64 as version set: TODO

x-hgg-x · 2024-10-22T14:05:53Z

During this process I also found how to implement weak dependencies in pubgrub.

We need to lower this clause: feature => (dep => dep_feature).

For package a with feature weak = ["b?/c"], we can set these packages and dependencies:

a#1.x.y+feat(weak) @ 1.2.3  =>  a#1.x.y+feat(weak)+enabled @ (off | 1.2.3)
                                                              ^ "off" is preferred to other versions when choosing versions
a#1.x.y+feat(weak)+enabled @ off  =>  a#1.x.y+dep(b) @ off
                                                       ^ "off" prevents the dep "b" to be activated (common to all versions)
a#1.x.y+feat(weak)+enabled @ 1.2.3  =>  a#1.x.y+dep(b)+feat(c) @ 1.2.3

Eh2406 · 2024-10-22T17:25:40Z

Thank you for spiking those out! I'm still hoping to get to an understanding of why these versions are algorithmically so much slower than similar nearby versions. In the meantime I'm happy to see any kind of speed up the comes along!

I did not have a chance to fully understand your fork. But the first suggestion sounds similar to pubgrub-rs/pubgrub#242. Similarly, at one time I had a branch that did thread local interning in this project. Given that your seeing a 50% improvement, I should really rebased both of those efforts.

Using a bitset as a VS is a pretty interesting idea. It does mean listing all versions of the dependent before the parent package can be processed. Fundamentally, if A depends on B@">2.0" we don't know how to express ">2.0" until we listed the versions of B which means we need to wait for that network request to return the dependencies of A. But that is a problem we can figure out later, if the performance numbers are good enough.

x-hgg-x · 2024-10-23T07:35:36Z

It does mean listing all versions of the dependent before the parent package can be processed.

Yes, but processing the requirement once par package per call to resolve() in the get_dependencies() method should be better than reprocessing it at each loop iteration in the prioritize() method for all packages and in the choose_version() method for the activated package.

x-hgg-x · 2024-10-24T18:12:45Z

These two lines seem incorrect:
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L113
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L141

They cause the following tests to fail:

#[test]
fn test_is_disjoint() {
    let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
    let n1 = Term::Negative(p1.unwrap_positive().complement());

    let p2 = Term::Positive(Range::<u32>::lower_than(10u32));
    let n2 = Term::Negative(p2.unwrap_positive().complement());

    assert!(p1.is_disjoint(&p2)); // true
    assert!(p1.is_disjoint(&n2)); // true
    assert!(n1.is_disjoint(&p2)); // true
    assert!(n1.is_disjoint(&n2)); // false
}

#[test]
fn test_subset_of() {
    let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
    let n1 = Term::Negative(p1.unwrap_positive().complement());

    let p2 = Term::Positive(Range::<u32>::higher_than(10u32));
    let n2 = Term::Negative(p2.unwrap_positive().complement());

    assert!(p1.subset_of(&p2)); // true
    assert!(p1.subset_of(&n2)); // true
    assert!(n1.subset_of(&p2)); // false
    assert!(n1.subset_of(&n2)); // true
}

As the docs, Positive(r) and Negative(r.complement()) should be equivalent:
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L12

But there is also a special handling when the term is positive: https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/internal/partial_solution.rs#L246

Can Positive(r) and Negative(r.complement()) be used interchangeably? It could be useful if we need to add custom incompatibilities with positive and negative terms (I want to try extending the api of get_dependencies() to add a OneOf variant, which could replace the Wide package variant for dependencies covering multiple major versions).

Eh2406 · 2024-10-28T16:13:30Z

Sorry for the slow reply. The holidays are over so hopefully I can be more prompt going forward.

As the docs, Positive(r) and Negative(r.complement()) should be equivalent:
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L12

Everyone has been confused by this when learning the PubGrub code, myself included. The rest of the paragraph you linked to tries to explain when Positive(r) and Negative(r.complement()) are not equivalent. That documentation was added when konstin, was confused by the issue here. I'm always happy to add more documentation, or restructure the code, to make the distinction clearer.

x-hgg-x · 2024-10-28T18:03:34Z

I made a PR to improve documentation: pubgrub-rs/pubgrub#264.

x-hgg-x · 2024-11-01T15:03:08Z

I implemented multiple optimizations in pubgrub in https://github.com/x-hgg-x/pubgrub/tree/perf (including the two above optimizations), and I applied it for cargo resolution in a new repo: https://github.com/x-hgg-x/pubgrub-bench.

Resolution time for solana-archiver-lib v1.1.12:

Base resolution time is around 90s using your repo.
When using u32 for package type and u64 for version set and terms, the resolution time drops to around 3s: perf1.tar.gz
By optimizing processing of incompatibilities contradictions (x-hgg-x/pubgrub@9490907), the resolution time drops again to around 1.6s: perf2.tar.gz

Sorted list of packages in the solution: solution.txt.tar.gz
Pubgrub log: pubgrub.log.tar.gz

Eh2406 mentioned this issue Aug 22, 2024

perf: remove more redundant scanning pubgrub-rs/pubgrub#251

Merged

x-hgg-x mentioned this issue Oct 10, 2024

Improve resolver speed rust-lang/cargo#14663

Merged

x-hgg-x mentioned this issue Oct 12, 2024

fix(registry): make sure to process a download after it has completed rust-lang/cargo#14680

Closed

This was referenced Oct 15, 2024

fix(resolver): avoid cloning when iterating using RcVecIter rust-lang/cargo#14690

Merged

fix(perf): switch global allocator to mimalloc rust-lang/cargo#14691

Open

fix(resolver): share conflict cache between activation retries rust-lang/cargo#14692

Merged

Investigate why Solana is so slow. #6

Investigate why Solana is so slow. #6

Comments

Eh2406 commented Aug 14, 2024

Eh2406 commented Aug 19, 2024 • edited Loading

Eh2406 commented Aug 21, 2024 • edited Loading

x-hgg-x commented Oct 7, 2024 • edited Loading

Eh2406 commented Oct 7, 2024

x-hgg-x commented Oct 8, 2024 • edited Loading

x-hgg-x commented Oct 8, 2024

x-hgg-x commented Oct 8, 2024 • edited Loading

x-hgg-x commented Oct 8, 2024 • edited Loading

Eh2406 commented Oct 8, 2024

x-hgg-x commented Oct 8, 2024

Eh2406 commented Oct 8, 2024

x-hgg-x commented Oct 8, 2024 • edited Loading

Eh2406 commented Oct 8, 2024

x-hgg-x commented Oct 8, 2024 • edited Loading

Eh2406 commented Oct 9, 2024

x-hgg-x commented Oct 9, 2024 • edited Loading

Eh2406 commented Oct 9, 2024

x-hgg-x commented Oct 9, 2024 • edited Loading

x-hgg-x commented Oct 9, 2024

x-hgg-x commented Oct 10, 2024 • edited Loading

x-hgg-x commented Oct 10, 2024

x-hgg-x commented Oct 10, 2024 • edited Loading

Eh2406 commented Oct 10, 2024 • edited Loading

x-hgg-x commented Oct 10, 2024 • edited Loading

Eh2406 commented Oct 11, 2024

x-hgg-x commented Oct 11, 2024 • edited Loading

x-hgg-x commented Oct 11, 2024 • edited Loading

Eh2406 commented Oct 11, 2024

x-hgg-x commented Oct 11, 2024 • edited Loading

Eh2406 commented Oct 12, 2024

x-hgg-x commented Oct 12, 2024 • edited Loading

x-hgg-x commented Oct 12, 2024 • edited Loading

x-hgg-x commented Oct 13, 2024 • edited Loading

Eh2406 commented Oct 14, 2024

Eh2406 commented Oct 15, 2024

x-hgg-x commented Oct 22, 2024 • edited Loading

x-hgg-x commented Oct 22, 2024 • edited Loading

Eh2406 commented Oct 22, 2024

x-hgg-x commented Oct 23, 2024 • edited Loading

x-hgg-x commented Oct 24, 2024 • edited Loading

Eh2406 commented Oct 28, 2024

x-hgg-x commented Oct 28, 2024

x-hgg-x commented Nov 1, 2024 • edited Loading

Eh2406 commented Aug 19, 2024 •

edited

Loading

Eh2406 commented Aug 21, 2024 •

edited

Loading

x-hgg-x commented Oct 7, 2024 •

edited

Loading

x-hgg-x commented Oct 8, 2024 •

edited

Loading

x-hgg-x commented Oct 8, 2024 •

edited

Loading

x-hgg-x commented Oct 8, 2024 •

edited

Loading

x-hgg-x commented Oct 8, 2024 •

edited

Loading

x-hgg-x commented Oct 8, 2024 •

edited

Loading

x-hgg-x commented Oct 9, 2024 •

edited

Loading

x-hgg-x commented Oct 9, 2024 •

edited

Loading

x-hgg-x commented Oct 10, 2024 •

edited

Loading

x-hgg-x commented Oct 10, 2024 •

edited

Loading

Eh2406 commented Oct 10, 2024 •

edited

Loading

x-hgg-x commented Oct 10, 2024 •

edited

Loading

x-hgg-x commented Oct 11, 2024 •

edited

Loading

x-hgg-x commented Oct 11, 2024 •

edited

Loading

x-hgg-x commented Oct 11, 2024 •

edited

Loading

x-hgg-x commented Oct 12, 2024 •

edited

Loading

x-hgg-x commented Oct 12, 2024 •

edited

Loading

x-hgg-x commented Oct 13, 2024 •

edited

Loading

x-hgg-x commented Oct 22, 2024 •

edited

Loading

x-hgg-x commented Oct 22, 2024 •

edited

Loading

x-hgg-x commented Oct 23, 2024 •

edited

Loading

x-hgg-x commented Oct 24, 2024 •

edited

Loading

x-hgg-x commented Nov 1, 2024 •

edited

Loading