Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incr.comp.: Explore delayed read-edge deduplication or getting rid of it entirely. #45873

Closed
michaelwoerister opened this issue Nov 8, 2017 · 19 comments
Labels
A-incr-comp Area: Incremental compilation C-enhancement Category: An issue proposing an enhancement or a PR with one. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. WG-incr-comp Working group: Incremental compilation

Comments

@michaelwoerister
Copy link
Member

At the moment the compiler will always eagerly deduplicate read-edges in CurrentDepGraph::read_index. In order to be able to do this the compiler has to allocate a HashSet for each task. My suspicion is that there is not much duplication to begin with and that there's potential for optimization here:

  • If there is not much duplication to begin with we could skip deduplication entirely for non-anonymous nodes. There should be no semantic difference and a slight performance hit only for actually duplicated nodes.
  • If deduplication seems like a good idea, doing so during serialization instead of immediately has a few potential benefits:
    • Dep-graph serialization happens while LLVM is running in parallel, so more often than not any work done there might not add to the actual compile-time.
    • When compilation exits early with an error, we don't save the dep-graph at all and the work for deduplication has been wasted. If compilation already reached dep-graph serialization, the chances are high that it will succeed.
    • When we already have the whole dep-graph we can implement deduplication more efficiently because we know when we don't even have to allocate a hash set at all or, if we have to, we know which capacity will suffice.

So the first step would be to collect some data on how much read-edge duplication there even is. This is most easily done by modifying the compiler to count duplicates (in librustc/dep_graph/graph.rs) and print the number/percentage in -Zincremental-info. This modified compiler can then be used to compile a number of crates to get an idea what's going on.

If duplication is low (e.g. less than 2% of reads got filtered out), then we could just remove de-duplication and test the effect with a try-build. Otherwise, we can move deduplication to DepGraph::serialize() and measure the performance impact of that.

@michaelwoerister michaelwoerister added A-incr-comp Area: Incremental compilation C-enhancement Category: An issue proposing an enhancement or a PR with one. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. WG-incr-comp Working group: Incremental compilation labels Nov 8, 2017
@wesleywiser
Copy link
Member

I'm working on this.

@wesleywiser
Copy link
Member

I've collected some data from a few crates. It appears that duplicate reads happen much more frequently than originally thought. Of the crates I tested, the lowest was 19.2% duplicated reads. @michaelwoerister since this doesn't match what you were expecting, would you mind looking at my changes and verifying that I didn't do something wrong? My changes are available in my incr_duplicate_read_stats branch (diff).

Full results here

@michaelwoerister
Copy link
Member Author

Huh, this is very interesting. Thank you so much for collecting this data. Goes to show again that one should always measure before assuming anything.

The data collection in your branch looks correct, except for one thing: For anonymous nodes we can't really make deduplication delayed, so counting them too will skew the assessment of what effect the proposed optimization would have. Sorry for not mentioning this earlier. Would you mind adapting your code so that it uses a separate counter for anonymous nodes?

I'm not going to make a prediction on how this will affect the numbers :)

@wesleywiser
Copy link
Member

Reran the same tests with the updated code. The results aren't significantly different (the above "Full results here" link now points to the updated results).

@michaelwoerister
Copy link
Member Author

Thanks, @wesleywiser!

OK, let's see, Cargo is the most resource intensive case here with 10,260,175 reads total and 1,173,234 of those after filtering. Each read costs 4 bytes to store, so keeping around all reads would cost about 40MB while when deduplicating immediately we only need 4.5MB. That's a pretty big difference but in absolute numbers it doesn't seem like a showstopper. OTOH, since we are storing so much more data in the reads vectors, the cost of re-allocations during vector growth might eat up any wins we get from delaying deduplication.

I see two options on how to proceed:

  • Leave things as they are for now and just merge stats collection.
  • Go ahead and implement delayed deduplication in order to find out how it performs.

I'll let you decide, @wesleywiser :)

@wesleywiser
Copy link
Member

I'm game to implement delayed duplication and see how it performs. Do you mind if I go ahead and push up a PR for the stats collection?

Also, what's the best way to measure the performance before and after the change? Just use time?

@michaelwoerister
Copy link
Member Author

I'm game to implement delayed duplication and see how it performs.

Cool, I'm really curious how it will do.

Do you mind if I go ahead and push up a PR for the stats collection?

No, please go ahead!

Also, what's the best way to measure the performance before and after the change?

I usually use cargo with a custom rustup toolchain. Since cargo already prints compile times, I don't need anything else. But time is fine too.

That's for local testing. Once you have a version that you think is optimized enough, you can open a PR and @Mark-Simulacrum will trigger a perf.rust-lang.org measurement for us. That will give us a good idea on what performance will look like.

bors added a commit that referenced this issue Nov 20, 2017
…elwoerister

[incremental] Collect stats about duplicated edge reads from queries

Part of #45873
@wesleywiser
Copy link
Member

@michaelwoerister I've tried delaying deduplication until serialization but I'm not seeing much of a difference in compilation time. I've set a rustup override that points to a custom toolchain in my stage2 folder and I'm compiling with CARGO_INCREMENTAL=1 cargo build. That's the right command right?

If the issue is reallocations, would preallocating the reads vector help?

(For reference purposes, my code is in my incr_delay_dedup branch)

@michaelwoerister
Copy link
Member Author

I've tried delaying deduplication until serialization but I'm not seeing much of a difference in compilation time.

I think the potential for improvement here is about 2% for a debug build with an empty cache. For a rebuild this might actually slow things down, since we'll be deduplicating already deduplicated vectors :/ We'd have to store a flag with each vector that indicates whether it needs deduplication.

I've set a rustup override that points to a custom toolchain in my stage2 folder and I'm compiling with CARGO_INCREMENTAL=1 cargo build. That's the right command right?

Pretty much, but you have to be careful to measure the correct thing. A few remarks:

  • You usually don't need a stage2 compiler and unless you want to compile something with procedural macros or plugins in it, you don't even need a full stage1 compiler. Point your custom toolchain to the stage1 folder and compile either with ./x.py build --stage 1 src/libtest for a limited compiler (which can still compile crates like regex or ripgrep). That should cut down massively on your cycle time.
  • Make sure that you're test compiler will have comparable performance to the compiler we'll be shipping to users. In order to do that set debug-assertions = false and codegen-units = 1 in your config.toml.
  • We want to measure the two extreme cases:
    1. Re-compiling the local crates of a project with an empty cache (compiling the dependencies from crates.io would just add noise). In order to do this, you have to run rm -rf ./target/debug/incremental && touch ./src/lib.rs && CARGO_INCREMENTAL=1 cargo build a few times. This will delete the incr. comp. cache but not the already compiled dependencies from crates.io. So the first measurement will be longer and the subsequent ones will measure what we are interested in. If the project you are compiling is an executable, you'll have to touch ./src/main.rs instead.
    2. Re-compiling the local crates of a project with a full cache and no changes. For that you just have to run touch ./src/lib.rs && CARGO_INCREMENTAL=1 cargo build a few times. It's the same as above, just without deleting the cache in between runs.

If the issue is reallocations, would preallocating the reads vector help?

There are a few things in our implementation that can be improved:

  • Don't use a BTreeSet for deduplication. In my measurements FxHashSet has always been faster, even for tiny collection sizes.
  • Don't use a set at all when the Vec of edges to deduplicate is short:
    • If the Vec has a length smaller than or equal to 1 then there is no need to allocate a set. Just copy the data.
    • If the length is 2, just directly compare the two elements.
    • If the length is greater, use HashSet::with_capacity(vec.len()) in order to avoid re-allocations.
  • Don't copy the whole graph just for deduplication is you do here: https://github.com/wesleywiser/rust/blob/413211a05ae1e283350213e758a5798cd99b24b9/src/librustc/dep_graph/graph.rs#L434-L443 Instead, deduplicate as you write to the "serialized" version in this loop:
    for (current_dep_node_index, edges) in current_dep_graph.edges.iter_enumerated() {
    let start = edge_list_data.len() as u32;
    // This should really just be a memcpy :/
    edge_list_data.extend(edges.iter().map(|i| SerializedDepNodeIndex::new(i.index())));
    let end = edge_list_data.len() as u32;
    debug_assert_eq!(current_dep_node_index.index(), edge_list_indices.len());
    edge_list_indices.push((start, end));
    }
    Note that we cannot know the correct total_edge_count beforehand anymore. You can just remove it. It was mostly an optimization to avoid having to grow the edge_list_data vector during the loop. It might make sense to start out with a capacity of something like nodes.len() * 10 for edge_list_data though.

@wesleywiser
Copy link
Member

Thanks, that's really helpful! I implemented your feedback but the results don't look very good. (The source is available in the branch I linked above)

@michaelwoerister
Copy link
Member Author

Well, that's sad. But your implementation looks correct (except for not preserving edge order in the len() > 2 case, but that should have no impact on performance) and I can't think of any more optimizations that would potentially make a big difference here.

I hope you still found it interesting to try a few things out! I"ll make sure to mention your efforts on the next impl period newsletter.

@wesleywiser
Copy link
Member

Thanks! One thing I'm left wondering is if the time saved by delaying the deduplication is getting eaten by resizing the reads vec. Perhaps collecting some data about the most common size for it and preallocating to that size would improve the performance?

@michaelwoerister
Copy link
Member Author

If you are up for that you can certainly try.

@wesleywiser
Copy link
Member

Ok! I'll try to collect some data and report back.

@wesleywiser
Copy link
Member

Here's what I've found:

graph
graph
graph
graph

@arielb1
Copy link
Contributor

arielb1 commented Dec 3, 2017

@wesleywiser

This graph looks like it's missing some axes. Could you upload the raw data?

@wesleywiser
Copy link
Member

@arielb1 Sure! The graphs are line plots of the values observed for edges.len() during serialization. The stacking in the first image indicates a huge number of duplicate values.

The raw data is available here. There's one line for each iteration of this loop.

@wesleywiser
Copy link
Member

@michaelwoerister Given the huge number of empty vectors, I think preallocating would probably cause a huge memory blowup. Unless you have any ideas, I think we can probably close this issue.

@michaelwoerister
Copy link
Member Author

@wesleywiser Yeah. Also, pre-allocating isn't free either.

Thanks again for all your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-incr-comp Area: Incremental compilation C-enhancement Category: An issue proposing an enhancement or a PR with one. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. WG-incr-comp Working group: Incremental compilation
Projects
None yet
Development

No branches or pull requests

3 participants