-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize a slew of Cargo internals #4118
Conversation
r? @brson (rust_highfive has picked a reviewer for you, use r? to override) |
r? @matklad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick overview; feel free to ignore most of what I said here. Just some thoughts.
src/cargo/core/resolver/mod.rs
Outdated
enum RcList<T> { | ||
Data(Rc<(T, RcList<T>)>), | ||
Empty, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is equivalent to LinkedList<Rc<T>>
, right? Or is there some advantage to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah not quite, the intention here is that Clone for RcList<T>
is O(1) wheras Clone for LinkedList<T>
is O(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Rc<LinkedList<T>>
would get us what we wanted. Though that's kind of questionable at best...
src/cargo/core/resolver/mod.rs
Outdated
@@ -369,21 +403,32 @@ impl<T> RcVecIter<T> { | |||
} | |||
} | |||
|
|||
impl<T> Clone for RcVecIter<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here that we're not deriving this because we want to avoid T: Clone
?
@@ -369,21 +403,32 @@ impl<T> RcVecIter<T> { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're refactoring anyway, the only thing this is used for is to get the equivalent of first()
on the underlying vec. Maybe just use that in the first place?
src/cargo/core/resolver/mod.rs
Outdated
impl<T> Iterator for RcVecIter<T> where T: Clone { | ||
type Item = (usize, T); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the only reason we return (usize, T)
here is because activate_deps_loop
uses it for debugging (trace-level). Is it worth keeping? It's arguably equivalent to just storing Enumerate<RcVecIter<T>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah unfotunately Enumerate
doesn't allow access to its contents though :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also historically that cur
has been critical to debugging problems to resolve, so I would want to keep it if we could!
prev_active.iter().any(|a| *a == b.summary) || | ||
prev_active.iter().all(|a| { | ||
!compatible(a.version(), b.summary.version()) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is effectively worst-case looping through prev_active
twice; we could avoid that with a relatively simple for loop. This seems like a clear win to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah this look actually doesn't matter too much, 100% of slowdown so far has stemmed from too much allocation, nothing's been "you should actually run less code" so far!
src/cargo/core/dependency.rs
Outdated
self | ||
} | ||
|
||
pub fn set_kind(&mut self, kind: Kind) -> &mut Dependency { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like that these two three methods are duplicated between Dependency
and DependencyInner
. Looks like DependencyInner
plays dual role of a dependency builder (it exposes more setters then Dependency
) and an Rc wrappeed data store. Also, there are a bunch of pub
gettrs on DependencyInner
which can be private:
fn version_req(&self) -> &VersionReq { &self.req }
fn name(&self) -> &str { &self.name }
fn source_id(&self) -> &SourceId { &self.source_id }
fn kind(&self) -> Kind { self.kind }
fn specified_req(&self) -> bool { self.specified_req }
fn platform(&self) -> Option<&Platform> {
self.platform.as_ref()
}
Ouch, and lock_to
on DependencyInner
can be private as well!
Could we somehow improve Dependency/DependencyInner separation? Perhaps we can make a private DependencyData
without any getters as a thing inside Rc
inside Dependency
, and a public DependencyBuilder
with all public fields and no getters, so that one can use sweet {.. default}
syntax to construct deps without setters boilerplate.
Conflating refactorings and perf optimizations is not necessary a good idea, but perhaps we can make things nicer in a simple way by just moving all those "can be private" things from DependencyInner
directly into Dependency
, just to offset the duplication in set_version
and set_kind
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the discovery of Rc::make_mut
I actually think that DependencyInner
is just no longer needed entirely. The perf should be the same if you just use &mut self
methods on Dependency
as they'd all hit the "I'm an owned Rc" fast path anyway.
I don't mind doing that as part of this!
src/cargo/core/dependency.rs
Outdated
@@ -327,14 +339,14 @@ impl Dependency { | |||
self.inner.matches_id(id) | |||
} | |||
|
|||
pub fn map_source(self, to_replace: &SourceId, replace_with: &SourceId) | |||
pub fn map_source(mut self, to_replace: &SourceId, replace_with: &SourceId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantically we want map_source(&self) -> Cow<Dependency>
here, but I don't think that would be an actual improvement performance or readability wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah moving around owned Rc
by value is super cheap, I don't think I've ever seen it in a profile anywhere. This should arguably just take &mut self
though.
src/cargo/core/registry.rs
Outdated
let real_summaries = match self.sources.get_mut(dep.source_id()) { | ||
Some(src) => Some(src.query(&dep)?), | ||
None => None, | ||
let mut sources = self.sources.borrow_mut(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the only place where we need RefCell
, and it is because we use self.lock
down the line. self.lock
only looks at self.locked
field, so perhaps we can extract fn lock_helper(locked: &HashMap<to many angle brackets>, summary: Summary) -> Summary
and call it directly in the closure? That way we should please the borrowchecker without RefCell`ing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RefCell
was a bit surprising to see here: I expected to find some actual mutation via constant reference, so perhaps just adding a comment why we need refcell will help just as well as refactoring it away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought of originally doing that, but the call to warn_bad_override
caught me by surprise. The mutable borrow on self.sources
means we otherwise couldn't use self
, so this seemed like the "cleanest" way of implementing this.
Definitely warrants a comment though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the call to warn_bad_override caught me by surprise.
Hm, shouldn't the borrow of sources
be dead by the time we call .warn_bad_override
? If it is the case, then adding a block {}
could help? Something like this:
let (override_summary, n, to_warn) = {
// Look for an override and get ready to query the real source.
//
// Note that we're not using `&mut self` to load `self.sources` here but
// rather we're going through `RefCell::borrow_mut`. This is currently
// done to have access to both the `lock` function and the
// `warn_bad_override` functions we call below.
let override_summary = self.query_overrides(&dep)?;
let mut sources = self.sources.get_mut();
let source = match sources.get_mut(dep.source_id()) {
Some(src) => src,
None => {
if override_summary.is_some() {
bail!("override found but no real ones");
}
return Ok(())
}
};
match override_summary {
// If we have an override summary then we query the source to sanity
// check its results. We don't actually use any of the summaries it
// gives us though.
Some(s) => {
let mut n = 0;
let mut to_warn = None;
source.query(dep, &mut |summary| {
n += 1;
to_warn = Some(summary);
})?;
(s, n, to_warn)
}
// If we don't have an override then we just ship everything
// upstairs after locking the summary
None => {
return source.query(dep, &mut |summary| (unimplemented!()))
}
}
};
if n > 1 {
bail!("found an override with a non-locked list");
} else if let Some(summary) = to_warn {
self.warn_bad_override(&override_summary, &summary)?;
}
f(self.lock(override_summary));
Ok(())
This variant gets pretty ugly though, so perhaps we can refactor logic here to handle the easy cases up front?
let (override_summary, n, to_warn) = {
let overriden = self.query_overrides(&dep)?;
let real = self.sources.get_mut().get_mut(dep.source_id());
match (overriden, real) {
(Some(_), None) => bail!("override found but no real ones"),
(None, None) => return Ok(()),
(None, Some(source)) => return source.query(dep, &mut |summary| (unimplemented!())),
(Some(override_summary), Some(source)) => {
let mut n = 0;
let mut to_warn = None;
source.query(dep, &mut |summary| {
n += 1;
to_warn = Some(summary);
})?;
(override_summary, n, to_warn)
}
}
};
if n > 1 {
bail!("found an override with a non-locked list");
} else if let Some(summary) = to_warn {
self.warn_bad_override(&override_summary, &summary)?;
}
f(self.lock(override_summary));
Ok(())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexcrichton pinging about the previous comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry! Excellent suggestions and they do indeed work!
src/cargo/core/registry.rs
Outdated
// After all that's said and done we need to check the override summary | ||
// itself. If present we'll do some more validation and yield it up, | ||
// otherwise we're done. | ||
let override_summary = match override_summary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hard to match this match
with the if
inside the lambda inside the query. Perhaps something like
match override_summary {
None => { source.query(dep, &mut|summary| f(self.lock(summary)))?; Ok(()) }
Some(overrie_summary) => {
let mut n = 0;
let mut to_warn = None;
source.query(dep, &mut |summary| { n += 1; to_warn = Some(summary)})
}
}
would read better? That is, pull query
inside both arms of a match
. Perhaps in the second arm we could even use query_vec
, because we are going to print a warning anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/cargo/core/resolver/mod.rs
Outdated
@@ -71,7 +72,7 @@ mod encode; | |||
/// | |||
/// Each instance of `Resolve` also understands the full set of features used | |||
/// for each package. | |||
#[derive(PartialEq, Eq, Clone)] | |||
#[derive(Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's drop derive(Clone)
? We don't actually clone Resolve
, and we probably shouldn't because it is a giant pile of HashMaps
.
src/cargo/core/resolver/mod.rs
Outdated
@@ -256,12 +257,43 @@ impl<'a> Iterator for DepsNotReplaced<'a> { | |||
} | |||
} | |||
|
|||
enum RcList<T> { | |||
Data(Rc<(T, RcList<T>)>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this could stack overflow on drop.
src/cargo/core/resolver/mod.rs
Outdated
#[derive(Clone)] | ||
struct Context<'a> { | ||
activations: HashMap<(String, SourceId), Vec<Rc<Summary>>>, | ||
resolve_graph: Graph<PackageId>, | ||
activations: HashMap<String, HashMap<SourceId, Vec<Summary>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is almost always a single activation for a packages, so we could use a SmallMap
here, but looks like there isn't one on crates.io? Should be pretty stratforword to roll one on top of smallvec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is something I need to add a TODO
for. The fields of activations
and the binary heap of remaining deps are still cloned (if needed) on all activates. These are the number one and two costs in resolution. They just disappeared in this PR b/c of the fast path which avoids cloning.
I thought for awhile about how to work with this but yeah I think the nested map being some sort of "small" thing makes sense. I was hoping we could hold out and wait for a persistent hash map though to avoid too many crazy data structures.
Once I saw the "don't clone as much" optimization erase this from the profile though I figured it'd be best to move on to the rest of Cargo :). Right now loading info from the index and parsing TOML is pretty high up there. Those I think would ideally be tackled next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 (at least, the branch is named hamt :) )
Just to clarify, I wanted to use SmallMap
just for the inner map here. Hm, and I now actually think that the innermost Vec
is what should be Small
. If it is correct that the number of activations is almost always one, we can try to reduce the number of allocations here by storing the first activation on the stack, and this is somewhat orthogonal to making the whole thing a persistent data structure.
☔ The latest upstream changes (presumably #4110) made this pull request unmergeable. Please resolve the merge conflicts. |
Updated! |
} | ||
|
||
let activated = cx.flag_activated(&candidate.summary, method); | ||
|
||
let candidate = match candidate.replace { | ||
Some(replace) => { | ||
cx.resolve_replacements.insert(candidate.summary.package_id().clone(), | ||
replace.package_id().clone()); | ||
cx.resolve_replacements.push((candidate.summary.package_id().clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New resolve_replacements
has slightly different behavior regarding duplicates. When we .iter().filter()
it, we'll now get all the duplicates, instead of the only one. Does this matter? If there are no duplicates possible, let's add a comment to resolve_replacements: RcList<(PackageId, PackageId)>,
explaining that it is effectively a map, and first elements must be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
let mut set = self.resolve_features.entry(pkgid.clone()) | ||
.or_insert_with(HashSet::new); | ||
for feature in used_features { | ||
if !set.contains(feature) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can microoptimize to set.entry(feature).or_insert_with(|| feature.to_string())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately hash sets don't have the entry
API :(
src/cargo/core/resolver/mod.rs
Outdated
} | ||
|
||
fn check_cycles(resolve: &Resolve, | ||
activations: &HashMap<(String, SourceId), Vec<Rc<Summary>>>) | ||
activations: &HashMap<String, HashMap<SourceId, Vec<Summary>>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could typealias activations
, if we indent to tweak this datastructure further.
@@ -204,6 +181,12 @@ impl<'cfg> RegistryIndex<'cfg> { | |||
_ => true, | |||
} | |||
}); | |||
summaries.query(dep) | |||
|
|||
for summary in summaries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now remove impl Registry for Vec<Summary>
, or implement it for an Iterator. I prefer dropping it because this is an implementation detail, but providing an impl
tricks the reader into believing that Vec<Summary>
is a real Registry
which is used somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing for all iterators is sort of tricky, so I figured dep.match(s)
isn't so hard for now.
fn load(&self, | ||
_root: &Path, | ||
path: &Path, | ||
data: &mut FnMut(&[u8]) -> CargoResult<()>) -> CargoResult<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be make generic to return CargoResult<T>
. Are we intentionally avoid monopolization here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah this needs to be object safe, so we've got to avoid the generics
You can see how I started down this rabbit hole! |
Updated! |
@bors r+ |
📌 Commit f0b6a87 has been approved by |
⌛ Testing commit f0b6a87 with merge a852f5f... |
💔 Test failed - status-travis |
This commit is a relatively serious optimization pass of the resolution phase in Cargo, targeted at removing as many allocations as possible from this phase. Executed as an iterative loop this phase of Cargo can often be costly for large graphs but it's run on every single build! The main optimization here is to avoid cloning the context and/or pushing a backtracking frame if there are no candidates left in the current list of candidates. That optimizes a fast-path for crates with lock files (almost all of them) and gets us to the point where cloning the context basically disappears from all profiling.
This already happens in a few other places in Cargo (e.g. `Dependency`) and `Summary` cloning turned up high in the profile, so let's make it cheaper.
Follow the same pattern with `Summary`
There's a few keys we don't need owned versions of, so try using Serde's zero-copy deserialization where we can.
Previously all intermediate stages would create and return `Vec<Summary>`, but this is a pretty costly operation once you start layering. Ideally we'd use an iterator-based approach here but working with that in trait objects is difficult, so this commit takes a closure-based approach to avoid all the intermediate allocations that are thrown away.
Avoid some unnecessary clones, `to_vec`, etc. Not super optimizal but improves the profile here regardless.
Avoid unnecessary `String` allocations in hot paths that get run a lot for large graphs.
Removing some allocations arounds the stored hashes by having nested hash maps instead of tuple keys. Also remove an intermediate array when parsing dependencies through a custom implementation of `Deserialize`. While this doesn't make this code path blazingly fast it definitely knocks it down in the profiles below other higher-value targets.
On "fresh" builds this ends up just wasting a lot of time!
Relatively expensive to calculate, never changes, easy to add a cache!
No need for it to be exposed any more, let's just use `Rc::make_mut` judiciously.
Less branches and more intuitive flow.
It's super expensive to clone a `Resolve` and the resolution implementation no longer needs to do so, let's remove the impl.
Turn recursion into a loop
They're basically barely used now anyway.
Avoids duplicating tons of maps!
Some choice refactoring makes it no longer necessary!
Failure looks legitimate.
|
@bors: r=matklad |
📌 Commit a389d48 has been approved by |
Optimize a slew of Cargo internals Cargo has historically had very little optimization applied to it. Despite that it's pretty speedy today but there's always a desire to be faster! I've noticed Cargo being particularly sluggish on projects like Servo and rust-lang/rust, so I started profiling and found quite a few low-hanging fruit! This PR is a slew of optimizations across Cargo for various things found here and there. The banner optimizations are: * Resolution with a lock file should be basically a noop in terms of execution time now. An optimization was done to avoid cloning `Context` unless necessary, and that basically means it doesn't get cloned now! As the number 1 source of slowdown in Cargo this is the biggest improvement. * Lots of pieces in `resolve` are now `Rc<T>` for being more easily cloneable. * `Summary` now internally contains an `Rc` like `Dependency`, making it much more quickly cloneable. * `Registry` as a trait no longer returns a `Vec` but rather takes a closure to yield summaries up, removing lots of intermediate arrays. * We no longer spawn a thread for all units of "fresh work", only when we're about to spawn a process. Almost everything here was guided through profiling `./x.py build` on rust-lang/rust or `cargo build -p log` on Servo. Both of these stress "noop resolution" and the former also stresses noop builds. Runs of `./x.py build` dropped from 4 to 2 seconds (with lots of low-hanging fruit still remaining in Cargo itself) and `cargo build -p log` dropped from 1.5s to 0.3s. Massif graphs showing Cargo's memory usage also show that the peak memory usage of Cargo in a noop build of Servo dropped from 300MB to 30MB during resolution. I'm hoping that none of these optimizations makes the code less readable and/or understandable. There are no algorithmic improvements in this PR other than those transitively picked up by making clones cheaper and/or allocating less.
☀️ Test successful - status-appveyor, status-travis |
Cargo has historically had very little optimization applied to it. Despite that it's pretty speedy today but there's always a desire to be faster! I've noticed Cargo being particularly sluggish on projects like Servo and rust-lang/rust, so I started profiling and found quite a few low-hanging fruit!
This PR is a slew of optimizations across Cargo for various things found here and there. The banner optimizations are:
Context
unless necessary, and that basically means it doesn't get cloned now! As the number 1 source of slowdown in Cargo this is the biggest improvement.resolve
are nowRc<T>
for being more easily cloneable.Summary
now internally contains anRc
likeDependency
, making it much more quickly cloneable.Registry
as a trait no longer returns aVec
but rather takes a closure to yield summaries up, removing lots of intermediate arrays.Almost everything here was guided through profiling
./x.py build
on rust-lang/rust orcargo build -p log
on Servo. Both of these stress "noop resolution" and the former also stresses noop builds.Runs of
./x.py build
dropped from 4 to 2 seconds (with lots of low-hanging fruit still remaining in Cargo itself) andcargo build -p log
dropped from 1.5s to 0.3s. Massif graphs showing Cargo's memory usage also show that the peak memory usage of Cargo in a noop build of Servo dropped from 300MB to 30MB during resolution.I'm hoping that none of these optimizations makes the code less readable and/or understandable. There are no algorithmic improvements in this PR other than those transitively picked up by making clones cheaper and/or allocating less.