Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize a slew of Cargo internals #4118

Merged
merged 19 commits into from
Jun 5, 2017
Merged

Optimize a slew of Cargo internals #4118

merged 19 commits into from
Jun 5, 2017

Conversation

alexcrichton
Copy link
Member

Cargo has historically had very little optimization applied to it. Despite that it's pretty speedy today but there's always a desire to be faster! I've noticed Cargo being particularly sluggish on projects like Servo and rust-lang/rust, so I started profiling and found quite a few low-hanging fruit!

This PR is a slew of optimizations across Cargo for various things found here and there. The banner optimizations are:

  • Resolution with a lock file should be basically a noop in terms of execution time now. An optimization was done to avoid cloning Context unless necessary, and that basically means it doesn't get cloned now! As the number 1 source of slowdown in Cargo this is the biggest improvement.
  • Lots of pieces in resolve are now Rc<T> for being more easily cloneable.
  • Summary now internally contains an Rc like Dependency, making it much more quickly cloneable.
  • Registry as a trait no longer returns a Vec but rather takes a closure to yield summaries up, removing lots of intermediate arrays.
  • We no longer spawn a thread for all units of "fresh work", only when we're about to spawn a process.

Almost everything here was guided through profiling ./x.py build on rust-lang/rust or cargo build -p log on Servo. Both of these stress "noop resolution" and the former also stresses noop builds.

Runs of ./x.py build dropped from 4 to 2 seconds (with lots of low-hanging fruit still remaining in Cargo itself) and cargo build -p log dropped from 1.5s to 0.3s. Massif graphs showing Cargo's memory usage also show that the peak memory usage of Cargo in a noop build of Servo dropped from 300MB to 30MB during resolution.

I'm hoping that none of these optimizations makes the code less readable and/or understandable. There are no algorithmic improvements in this PR other than those transitively picked up by making clones cheaper and/or allocating less.

@rust-highfive
Copy link

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member Author

r? @matklad

@rust-highfive rust-highfive assigned matklad and unassigned brson Jun 2, 2017
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick overview; feel free to ignore most of what I said here. Just some thoughts.

enum RcList<T> {
Data(Rc<(T, RcList<T>)>),
Empty,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is equivalent to LinkedList<Rc<T>> , right? Or is there some advantage to this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah not quite, the intention here is that Clone for RcList<T> is O(1) wheras Clone for LinkedList<T> is O(n)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Rc<LinkedList<T>> would get us what we wanted. Though that's kind of questionable at best...

@@ -369,21 +403,32 @@ impl<T> RcVecIter<T> {
}
}

impl<T> Clone for RcVecIter<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here that we're not deriving this because we want to avoid T: Clone?

@@ -369,21 +403,32 @@ impl<T> RcVecIter<T> {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're refactoring anyway, the only thing this is used for is to get the equivalent of first() on the underlying vec. Maybe just use that in the first place?

impl<T> Iterator for RcVecIter<T> where T: Clone {
type Item = (usize, T);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the only reason we return (usize, T) here is because activate_deps_loop uses it for debugging (trace-level). Is it worth keeping? It's arguably equivalent to just storing Enumerate<RcVecIter<T>>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah unfotunately Enumerate doesn't allow access to its contents though :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also historically that cur has been critical to debugging problems to resolve, so I would want to keep it if we could!

prev_active.iter().any(|a| *a == b.summary) ||
prev_active.iter().all(|a| {
!compatible(a.version(), b.summary.version())
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is effectively worst-case looping through prev_active twice; we could avoid that with a relatively simple for loop. This seems like a clear win to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this look actually doesn't matter too much, 100% of slowdown so far has stemmed from too much allocation, nothing's been "you should actually run less code" so far!

self
}

pub fn set_kind(&mut self, kind: Kind) -> &mut Dependency {
Copy link
Member

@matklad matklad Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like that these two three methods are duplicated between Dependency and DependencyInner. Looks like DependencyInner plays dual role of a dependency builder (it exposes more setters then Dependency) and an Rc wrappeed data store. Also, there are a bunch of pub gettrs on DependencyInner which can be private:

    fn version_req(&self) -> &VersionReq { &self.req }
    fn name(&self) -> &str { &self.name }
    fn source_id(&self) -> &SourceId { &self.source_id }
    fn kind(&self) -> Kind { self.kind }
    fn specified_req(&self) -> bool { self.specified_req }
    fn platform(&self) -> Option<&Platform> {
        self.platform.as_ref()
    }

Ouch, and lock_to on DependencyInner can be private as well!

Could we somehow improve Dependency/DependencyInner separation? Perhaps we can make a private DependencyData without any getters as a thing inside Rc inside Dependency, and a public DependencyBuilder with all public fields and no getters, so that one can use sweet {.. default} syntax to construct deps without setters boilerplate.

Conflating refactorings and perf optimizations is not necessary a good idea, but perhaps we can make things nicer in a simple way by just moving all those "can be private" things from DependencyInner directly into Dependency, just to offset the duplication in set_version and set_kind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the discovery of Rc::make_mut I actually think that DependencyInner is just no longer needed entirely. The perf should be the same if you just use &mut self methods on Dependency as they'd all hit the "I'm an owned Rc" fast path anyway.

I don't mind doing that as part of this!

@@ -327,14 +339,14 @@ impl Dependency {
self.inner.matches_id(id)
}

pub fn map_source(self, to_replace: &SourceId, replace_with: &SourceId)
pub fn map_source(mut self, to_replace: &SourceId, replace_with: &SourceId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically we want map_source(&self) -> Cow<Dependency> here, but I don't think that would be an actual improvement performance or readability wise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah moving around owned Rc by value is super cheap, I don't think I've ever seen it in a profile anywhere. This should arguably just take &mut self though.

let real_summaries = match self.sources.get_mut(dep.source_id()) {
Some(src) => Some(src.query(&dep)?),
None => None,
let mut sources = self.sources.borrow_mut();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the only place where we need RefCell, and it is because we use self.lock down the line. self.lock only looks at self.locked field, so perhaps we can extract fn lock_helper(locked: &HashMap<to many angle brackets>, summary: Summary) -> Summary and call it directly in the closure? That way we should please the borrowchecker without RefCell`ing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RefCell was a bit surprising to see here: I expected to find some actual mutation via constant reference, so perhaps just adding a comment why we need refcell will help just as well as refactoring it away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought of originally doing that, but the call to warn_bad_override caught me by surprise. The mutable borrow on self.sources means we otherwise couldn't use self, so this seemed like the "cleanest" way of implementing this.

Definitely warrants a comment though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the call to warn_bad_override caught me by surprise.

Hm, shouldn't the borrow of sources be dead by the time we call .warn_bad_override? If it is the case, then adding a block {} could help? Something like this:

        let (override_summary, n, to_warn) = {
            // Look for an override and get ready to query the real source.
            //
            // Note that we're not using `&mut self` to load `self.sources` here but
            // rather we're going through `RefCell::borrow_mut`. This is currently
            // done to have access to both the `lock` function and the
            // `warn_bad_override` functions we call below.
            let override_summary = self.query_overrides(&dep)?;
            let mut sources = self.sources.get_mut();
            let source = match sources.get_mut(dep.source_id()) {
                Some(src) => src,
                None => {
                    if override_summary.is_some() {
                        bail!("override found but no real ones");
                    }
                    return Ok(())
                }
            };

            match override_summary {
                // If we have an override summary then we query the source to sanity
                // check its results. We don't actually use any of the summaries it
                // gives us though.
                Some(s) => {
                    let mut n = 0;
                    let mut to_warn = None;
                    source.query(dep, &mut |summary| {
                        n += 1;
                        to_warn = Some(summary);
                    })?;
                    (s, n, to_warn)
                }

                // If we don't have an override then we just ship everything
                // upstairs after locking the summary
                None => {
                    return source.query(dep, &mut |summary| (unimplemented!()))
                }
            }
        };

        if n > 1 {
            bail!("found an override with a non-locked list");
        } else if let Some(summary) = to_warn {
            self.warn_bad_override(&override_summary, &summary)?;
        }
        f(self.lock(override_summary));
        Ok(())

This variant gets pretty ugly though, so perhaps we can refactor logic here to handle the easy cases up front?

        let (override_summary, n, to_warn) = {
            let overriden = self.query_overrides(&dep)?;
            let real = self.sources.get_mut().get_mut(dep.source_id());
            match (overriden, real) {
                (Some(_), None) => bail!("override found but no real ones"),
                (None, None) => return Ok(()),
                (None, Some(source)) => return source.query(dep, &mut |summary| (unimplemented!())),
                (Some(override_summary), Some(source)) => {
                    let mut n = 0;
                    let mut to_warn = None;
                    source.query(dep, &mut |summary| {
                        n += 1;
                        to_warn = Some(summary);
                    })?;
                    (override_summary, n,  to_warn)
                }
            }
        };

        if n > 1 {
            bail!("found an override with a non-locked list");
        } else if let Some(summary) = to_warn {
            self.warn_bad_override(&override_summary, &summary)?;
        }
        f(self.lock(override_summary));
        Ok(())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton pinging about the previous comment :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry! Excellent suggestions and they do indeed work!

// After all that's said and done we need to check the override summary
// itself. If present we'll do some more validation and yield it up,
// otherwise we're done.
let override_summary = match override_summary {
Copy link
Member

@matklad matklad Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard to match this match with the if inside the lambda inside the query. Perhaps something like

match override_summary {
  None => { source.query(dep, &mut|summary| f(self.lock(summary)))?; Ok(()) }
  Some(overrie_summary) => {
    let mut n = 0;
    let mut to_warn = None;
    source.query(dep, &mut |summary| { n += 1; to_warn = Some(summary)})
  }
}

would read better? That is, pull query inside both arms of a match. Perhaps in the second arm we could even use query_vec, because we are going to print a warning anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -71,7 +72,7 @@ mod encode;
///
/// Each instance of `Resolve` also understands the full set of features used
/// for each package.
#[derive(PartialEq, Eq, Clone)]
#[derive(Clone)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop derive(Clone)? We don't actually clone Resolve, and we probably shouldn't because it is a giant pile of HashMaps.

@@ -256,12 +257,43 @@ impl<'a> Iterator for DepsNotReplaced<'a> {
}
}

enum RcList<T> {
Data(Rc<(T, RcList<T>)>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could stack overflow on drop.

#[derive(Clone)]
struct Context<'a> {
activations: HashMap<(String, SourceId), Vec<Rc<Summary>>>,
resolve_graph: Graph<PackageId>,
activations: HashMap<String, HashMap<SourceId, Vec<Summary>>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is almost always a single activation for a packages, so we could use a SmallMap here, but looks like there isn't one on crates.io? Should be pretty stratforword to roll one on top of smallvec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is something I need to add a TODO for. The fields of activations and the binary heap of remaining deps are still cloned (if needed) on all activates. These are the number one and two costs in resolution. They just disappeared in this PR b/c of the fast path which avoids cloning.

I thought for awhile about how to work with this but yeah I think the nested map being some sort of "small" thing makes sense. I was hoping we could hold out and wait for a persistent hash map though to avoid too many crazy data structures.

Once I saw the "don't clone as much" optimization erase this from the profile though I figured it'd be best to move on to the rest of Cargo :). Right now loading info from the index and parsing TOML is pretty high up there. Those I think would ideally be tackled next.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 (at least, the branch is named hamt :) )

Just to clarify, I wanted to use SmallMap just for the inner map here. Hm, and I now actually think that the innermost Vec is what should be Small. If it is correct that the number of activations is almost always one, we can try to reduce the number of allocations here by storing the first activation on the stack, and this is somewhat orthogonal to making the whole thing a persistent data structure.

@bors
Copy link
Contributor

bors commented Jun 2, 2017

☔ The latest upstream changes (presumably #4110) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member Author

Updated!

}

let activated = cx.flag_activated(&candidate.summary, method);

let candidate = match candidate.replace {
Some(replace) => {
cx.resolve_replacements.insert(candidate.summary.package_id().clone(),
replace.package_id().clone());
cx.resolve_replacements.push((candidate.summary.package_id().clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New resolve_replacements has slightly different behavior regarding duplicates. When we .iter().filter() it, we'll now get all the duplicates, instead of the only one. Does this matter? If there are no duplicates possible, let's add a comment to resolve_replacements: RcList<(PackageId, PackageId)>, explaining that it is effectively a map, and first elements must be unique.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

let mut set = self.resolve_features.entry(pkgid.clone())
.or_insert_with(HashSet::new);
for feature in used_features {
if !set.contains(feature) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can microoptimize to set.entry(feature).or_insert_with(|| feature.to_string()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately hash sets don't have the entry API :(

}

fn check_cycles(resolve: &Resolve,
activations: &HashMap<(String, SourceId), Vec<Rc<Summary>>>)
activations: &HashMap<String, HashMap<SourceId, Vec<Summary>>>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could typealias activations, if we indent to tweak this datastructure further.

@@ -204,6 +181,12 @@ impl<'cfg> RegistryIndex<'cfg> {
_ => true,
}
});
summaries.query(dep)

for summary in summaries {
Copy link
Member

@matklad matklad Jun 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now remove impl Registry for Vec<Summary>, or implement it for an Iterator. I prefer dropping it because this is an implementation detail, but providing an impl tricks the reader into believing that Vec<Summary> is a real Registry which is used somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing for all iterators is sort of tricky, so I figured dep.match(s) isn't so hard for now.

fn load(&self,
_root: &Path,
path: &Path,
data: &mut FnMut(&[u8]) -> CargoResult<()>) -> CargoResult<()>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be make generic to return CargoResult<T>. Are we intentionally avoid monopolization here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this needs to be object safe, so we've got to avoid the generics

@alexcrichton
Copy link
Member Author

👍 (at least, the branch is named hamt :) )

You can see how I started down this rabbit hole!

@alexcrichton
Copy link
Member Author

Updated!

@matklad
Copy link
Member

matklad commented Jun 5, 2017

@bors r+

@bors
Copy link
Contributor

bors commented Jun 5, 2017

📌 Commit f0b6a87 has been approved by matklad

@bors
Copy link
Contributor

bors commented Jun 5, 2017

⌛ Testing commit f0b6a87 with merge a852f5f...

@bors
Copy link
Contributor

bors commented Jun 5, 2017

💔 Test failed - status-travis

This commit is a relatively serious optimization pass of the resolution phase in
Cargo, targeted at removing as many allocations as possible from this phase.
Executed as an iterative loop this phase of Cargo can often be costly for large
graphs but it's run on every single build!

The main optimization here is to avoid cloning the context and/or pushing a
backtracking frame if there are no candidates left in the current list of
candidates. That optimizes a fast-path for crates with lock files (almost all of
them) and gets us to the point where cloning the context basically disappears
from all profiling.
This already happens in a few other places in Cargo (e.g. `Dependency`) and
`Summary` cloning turned up high in the profile, so let's make it cheaper.
Follow the same pattern with `Summary`
There's a few keys we don't need owned versions of, so try using Serde's
zero-copy deserialization where we can.
Previously all intermediate stages would create and return `Vec<Summary>`, but
this is a pretty costly operation once you start layering. Ideally we'd use an
iterator-based approach here but working with that in trait objects is
difficult, so this commit takes a closure-based approach to avoid all the
intermediate allocations that are thrown away.
Avoid some unnecessary clones, `to_vec`, etc. Not super optimizal but improves
the profile here regardless.
Avoid unnecessary `String` allocations in hot paths that get run a lot for large
graphs.
Removing some allocations arounds the stored hashes by having nested hash maps
instead of tuple keys. Also remove an intermediate array when parsing
dependencies through a custom implementation of `Deserialize`. While this
doesn't make this code path blazingly fast it definitely knocks it down in the
profiles below other higher-value targets.
On "fresh" builds this ends up just wasting a lot of time!
Relatively expensive to calculate, never changes, easy to add a cache!
No need for it to be exposed any more, let's just use `Rc::make_mut`
judiciously.
Less branches and more intuitive flow.
It's super expensive to clone a `Resolve` and the resolution implementation no
longer needs to do so, let's remove the impl.
Turn recursion into a loop
They're basically barely used now anyway.
Avoids duplicating tons of maps!
Some choice refactoring makes it no longer necessary!
@Mark-Simulacrum
Copy link
Member

Failure looks legitimate.

error: no method named `query` found for type `std::vec::Vec<core::summary::Summary>` in the current scope
   --> src/cargo/core/registry.rs:444:32
    |
444 |                 self.summaries.query(dep, f)
    |                                ^^^^^
    |
    = help: items from traits can only be used if the trait is im

@alexcrichton
Copy link
Member Author

@bors: r=matklad

@bors
Copy link
Contributor

bors commented Jun 5, 2017

📌 Commit a389d48 has been approved by matklad

@bors
Copy link
Contributor

bors commented Jun 5, 2017

⌛ Testing commit a389d48 with merge bbb38dc...

bors added a commit that referenced this pull request Jun 5, 2017
Optimize a slew of Cargo internals

Cargo has historically had very little optimization applied to it. Despite that it's pretty speedy today but there's always a desire to be faster! I've noticed Cargo being particularly sluggish on projects like Servo and rust-lang/rust, so I started profiling and found quite a few low-hanging fruit!

This PR is a slew of optimizations across Cargo for various things found here and there. The banner optimizations are:

* Resolution with a lock file should be basically a noop in terms of execution time now. An optimization was done to avoid cloning `Context` unless necessary, and that basically means it doesn't get cloned now! As the number 1 source of slowdown in Cargo this is the biggest improvement.
* Lots of pieces in `resolve` are now `Rc<T>` for being more easily cloneable.
* `Summary` now internally contains an `Rc` like `Dependency`, making it much more quickly cloneable.
* `Registry` as a trait no longer returns a `Vec` but rather takes a closure to yield summaries up, removing lots of intermediate arrays.
* We no longer spawn a thread for all units of "fresh work", only when we're about to spawn a process.

Almost everything here was guided through profiling `./x.py build` on rust-lang/rust or `cargo build -p log` on Servo. Both of these stress "noop resolution" and the former also stresses noop builds.

Runs of `./x.py build` dropped from 4 to 2 seconds (with lots of low-hanging fruit still remaining in Cargo itself) and `cargo build -p log` dropped from 1.5s to 0.3s. Massif graphs showing Cargo's memory usage also show that the peak memory usage of Cargo in a noop build of Servo dropped from 300MB to 30MB during resolution.

I'm hoping that none of these optimizations makes the code less readable and/or understandable. There are no algorithmic improvements in this PR other than those transitively picked up by making clones cheaper and/or allocating less.
@bors
Copy link
Contributor

bors commented Jun 5, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: matklad
Pushing bbb38dc to master...

@bors bors merged commit a389d48 into rust-lang:master Jun 5, 2017
@alexcrichton alexcrichton deleted the hamt branch June 6, 2017 14:08
@ehuss ehuss added this to the 1.19.0 milestone Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants