incr.comp.: Explore delayed read-edge deduplication or getting rid of it entirely. #45873

michaelwoerister · 2017-11-08T16:14:24Z

At the moment the compiler will always eagerly deduplicate read-edges in CurrentDepGraph::read_index. In order to be able to do this the compiler has to allocate a HashSet for each task. My suspicion is that there is not much duplication to begin with and that there's potential for optimization here:

If there is not much duplication to begin with we could skip deduplication entirely for non-anonymous nodes. There should be no semantic difference and a slight performance hit only for actually duplicated nodes.
If deduplication seems like a good idea, doing so during serialization instead of immediately has a few potential benefits:
- Dep-graph serialization happens while LLVM is running in parallel, so more often than not any work done there might not add to the actual compile-time.
- When compilation exits early with an error, we don't save the dep-graph at all and the work for deduplication has been wasted. If compilation already reached dep-graph serialization, the chances are high that it will succeed.
- When we already have the whole dep-graph we can implement deduplication more efficiently because we know when we don't even have to allocate a hash set at all or, if we have to, we know which capacity will suffice.

So the first step would be to collect some data on how much read-edge duplication there even is. This is most easily done by modifying the compiler to count duplicates (in librustc/dep_graph/graph.rs) and print the number/percentage in -Zincremental-info. This modified compiler can then be used to compile a number of crates to get an idea what's going on.

If duplication is low (e.g. less than 2% of reads got filtered out), then we could just remove de-duplication and test the effect with a try-build. Otherwise, we can move deduplication to DepGraph::serialize() and measure the performance impact of that.

The text was updated successfully, but these errors were encountered:

wesleywiser · 2017-11-08T16:28:32Z

I'm working on this.

wesleywiser · 2017-11-13T23:53:35Z

I've collected some data from a few crates. It appears that duplicate reads happen much more frequently than originally thought. Of the crates I tested, the lowest was 19.2% duplicated reads. @michaelwoerister since this doesn't match what you were expecting, would you mind looking at my changes and verifying that I didn't do something wrong? My changes are available in my incr_duplicate_read_stats branch (diff).

Full results here

michaelwoerister · 2017-11-14T09:46:28Z

Huh, this is very interesting. Thank you so much for collecting this data. Goes to show again that one should always measure before assuming anything.

The data collection in your branch looks correct, except for one thing: For anonymous nodes we can't really make deduplication delayed, so counting them too will skew the assessment of what effect the proposed optimization would have. Sorry for not mentioning this earlier. Would you mind adapting your code so that it uses a separate counter for anonymous nodes?

I'm not going to make a prediction on how this will affect the numbers :)

wesleywiser · 2017-11-15T23:21:29Z

Reran the same tests with the updated code. The results aren't significantly different (the above "Full results here" link now points to the updated results).

michaelwoerister · 2017-11-16T09:59:08Z

Thanks, @wesleywiser!

OK, let's see, Cargo is the most resource intensive case here with 10,260,175 reads total and 1,173,234 of those after filtering. Each read costs 4 bytes to store, so keeping around all reads would cost about 40MB while when deduplicating immediately we only need 4.5MB. That's a pretty big difference but in absolute numbers it doesn't seem like a showstopper. OTOH, since we are storing so much more data in the reads vectors, the cost of re-allocations during vector growth might eat up any wins we get from delaying deduplication.

I see two options on how to proceed:

Leave things as they are for now and just merge stats collection.
Go ahead and implement delayed deduplication in order to find out how it performs.

I'll let you decide, @wesleywiser :)

wesleywiser · 2017-11-16T15:22:27Z

I'm game to implement delayed duplication and see how it performs. Do you mind if I go ahead and push up a PR for the stats collection?

Also, what's the best way to measure the performance before and after the change? Just use time?

michaelwoerister · 2017-11-16T15:54:24Z

I'm game to implement delayed duplication and see how it performs.

Cool, I'm really curious how it will do.

Do you mind if I go ahead and push up a PR for the stats collection?

No, please go ahead!

Also, what's the best way to measure the performance before and after the change?

I usually use cargo with a custom rustup toolchain. Since cargo already prints compile times, I don't need anything else. But time is fine too.

That's for local testing. Once you have a version that you think is optimized enough, you can open a PR and @Mark-Simulacrum will trigger a perf.rust-lang.org measurement for us. That will give us a good idea on what performance will look like.

Part of rust-lang#45873

…elwoerister [incremental] Collect stats about duplicated edge reads from queries Part of #45873

wesleywiser · 2017-11-29T02:33:17Z

@michaelwoerister I've tried delaying deduplication until serialization but I'm not seeing much of a difference in compilation time. I've set a rustup override that points to a custom toolchain in my stage2 folder and I'm compiling with CARGO_INCREMENTAL=1 cargo build. That's the right command right?

If the issue is reallocations, would preallocating the reads vector help?

(For reference purposes, my code is in my incr_delay_dedup branch)

michaelwoerister · 2017-11-29T11:11:12Z

I've tried delaying deduplication until serialization but I'm not seeing much of a difference in compilation time.

I think the potential for improvement here is about 2% for a debug build with an empty cache. For a rebuild this might actually slow things down, since we'll be deduplicating already deduplicated vectors :/ We'd have to store a flag with each vector that indicates whether it needs deduplication.

I've set a rustup override that points to a custom toolchain in my stage2 folder and I'm compiling with CARGO_INCREMENTAL=1 cargo build. That's the right command right?

Pretty much, but you have to be careful to measure the correct thing. A few remarks:

You usually don't need a stage2 compiler and unless you want to compile something with procedural macros or plugins in it, you don't even need a full stage1 compiler. Point your custom toolchain to the stage1 folder and compile either with ./x.py build --stage 1 src/libtest for a limited compiler (which can still compile crates like regex or ripgrep). That should cut down massively on your cycle time.
Make sure that you're test compiler will have comparable performance to the compiler we'll be shipping to users. In order to do that set debug-assertions = false and codegen-units = 1 in your config.toml.
We want to measure the two extreme cases:
1. Re-compiling the local crates of a project with an empty cache (compiling the dependencies from crates.io would just add noise). In order to do this, you have to run rm -rf ./target/debug/incremental && touch ./src/lib.rs && CARGO_INCREMENTAL=1 cargo build a few times. This will delete the incr. comp. cache but not the already compiled dependencies from crates.io. So the first measurement will be longer and the subsequent ones will measure what we are interested in. If the project you are compiling is an executable, you'll have to touch ./src/main.rs instead.
2. Re-compiling the local crates of a project with a full cache and no changes. For that you just have to run touch ./src/lib.rs && CARGO_INCREMENTAL=1 cargo build a few times. It's the same as above, just without deleting the cache in between runs.

If the issue is reallocations, would preallocating the reads vector help?

There are a few things in our implementation that can be improved:

Don't use a BTreeSet for deduplication. In my measurements FxHashSet has always been faster, even for tiny collection sizes.
Don't use a set at all when the Vec of edges to deduplicate is short:
- If the Vec has a length smaller than or equal to 1 then there is no need to allocate a set. Just copy the data.
- If the length is 2, just directly compare the two elements.
- If the length is greater, use HashSet::with_capacity(vec.len()) in order to avoid re-allocations.

Don't copy the whole graph just for deduplication is you do here: https://github.com/wesleywiser/rust/blob/413211a05ae1e283350213e758a5798cd99b24b9/src/librustc/dep_graph/graph.rs#L434-L443 Instead, deduplicate as you write to the "serialized" version in this loop:

rust/src/librustc/dep_graph/graph.rs

Lines 440 to 448 in 77ab3a1

    
           for (current_dep_node_index, edges) in current_dep_graph.edges.iter_enumerated() { 
        
               let start = edge_list_data.len() as u32; 
        
               // This should really just be a memcpy :/ 
        
               edge_list_data.extend(edges.iter().map(|i| SerializedDepNodeIndex::new(i.index()))); 
        
               let end = edge_list_data.len() as u32; 
        
               debug_assert_eq!(current_dep_node_index.index(), edge_list_indices.len()); 
        
               edge_list_indices.push((start, end)); 
        
           }

Note that we cannot know the correct total_edge_count beforehand anymore. You can just remove it. It was mostly an optimization to avoid having to grow the edge_list_data vector during the loop. It might make sense to start out with a capacity of something like nodes.len() * 10 for edge_list_data though.

wesleywiser · 2017-12-01T03:33:56Z

Thanks, that's really helpful! I implemented your feedback but the results don't look very good. (The source is available in the branch I linked above)

michaelwoerister · 2017-12-01T14:50:55Z

Well, that's sad. But your implementation looks correct (except for not preserving edge order in the len() > 2 case, but that should have no impact on performance) and I can't think of any more optimizations that would potentially make a big difference here.

I hope you still found it interesting to try a few things out! I"ll make sure to mention your efforts on the next impl period newsletter.

wesleywiser · 2017-12-01T15:31:15Z

Thanks! One thing I'm left wondering is if the time saved by delaying the deduplication is getting eaten by resizing the reads vec. Perhaps collecting some data about the most common size for it and preallocating to that size would improve the performance?

michaelwoerister · 2017-12-01T15:59:09Z

If you are up for that you can certainly try.

wesleywiser · 2017-12-01T16:15:04Z

Ok! I'll try to collect some data and report back.

wesleywiser · 2017-12-02T16:41:38Z

Here's what I've found:

arielb1 · 2017-12-03T00:21:02Z

@wesleywiser

This graph looks like it's missing some axes. Could you upload the raw data?

wesleywiser · 2017-12-03T23:21:24Z

@arielb1 Sure! The graphs are line plots of the values observed for edges.len() during serialization. The stacking in the first image indicates a huge number of duplicate values.

The raw data is available here. There's one line for each iteration of this loop.

wesleywiser · 2017-12-06T15:23:38Z

@michaelwoerister Given the huge number of empty vectors, I think preallocating would probably cause a huge memory blowup. Unless you have any ideas, I think we can probably close this issue.

michaelwoerister · 2017-12-06T16:52:34Z

@wesleywiser Yeah. Also, pre-allocating isn't free either.

Thanks again for all your work on this!

wesleywiser added a commit to wesleywiser/rust that referenced this issue Nov 17, 2017

[incremental] Collect stats about duplicated edge reads from queries

f726634

Part of rust-lang#45873

wesleywiser mentioned this issue Nov 17, 2017

[incremental] Collect stats about duplicated edge reads from queries #46068

Merged

bors added a commit that referenced this issue Nov 20, 2017

Auto merge of #46068 - wesleywiser:incr_duplicate_read_stats, r=micha…

ef94d5c

…elwoerister [incremental] Collect stats about duplicated edge reads from queries Part of #45873

michaelwoerister closed this as completed Dec 6, 2017

michaelwoerister mentioned this issue Jan 22, 2018

Tracking Issue for Incremental Compilation #47660

Open

32 tasks

michaelwoerister mentioned this issue Dec 13, 2018

[experiment] Allocate DepGraph::read_index de-duplication hashset more lazily. #56724

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incr.comp.: Explore delayed read-edge deduplication or getting rid of it entirely. #45873

incr.comp.: Explore delayed read-edge deduplication or getting rid of it entirely. #45873

michaelwoerister commented Nov 8, 2017

wesleywiser commented Nov 8, 2017

wesleywiser commented Nov 13, 2017

michaelwoerister commented Nov 14, 2017

wesleywiser commented Nov 15, 2017

michaelwoerister commented Nov 16, 2017

wesleywiser commented Nov 16, 2017

michaelwoerister commented Nov 16, 2017

wesleywiser commented Nov 29, 2017

michaelwoerister commented Nov 29, 2017

wesleywiser commented Dec 1, 2017

michaelwoerister commented Dec 1, 2017

wesleywiser commented Dec 1, 2017

michaelwoerister commented Dec 1, 2017

wesleywiser commented Dec 1, 2017

wesleywiser commented Dec 2, 2017

arielb1 commented Dec 3, 2017

wesleywiser commented Dec 3, 2017

wesleywiser commented Dec 6, 2017

michaelwoerister commented Dec 6, 2017

incr.comp.: Explore delayed read-edge deduplication or getting rid of it entirely. #45873

incr.comp.: Explore delayed read-edge deduplication or getting rid of it entirely. #45873

Comments

michaelwoerister commented Nov 8, 2017

wesleywiser commented Nov 8, 2017

wesleywiser commented Nov 13, 2017

michaelwoerister commented Nov 14, 2017

wesleywiser commented Nov 15, 2017

michaelwoerister commented Nov 16, 2017

wesleywiser commented Nov 16, 2017

michaelwoerister commented Nov 16, 2017

wesleywiser commented Nov 29, 2017

michaelwoerister commented Nov 29, 2017

wesleywiser commented Dec 1, 2017

michaelwoerister commented Dec 1, 2017

wesleywiser commented Dec 1, 2017

michaelwoerister commented Dec 1, 2017

wesleywiser commented Dec 1, 2017

wesleywiser commented Dec 2, 2017

arielb1 commented Dec 3, 2017

wesleywiser commented Dec 3, 2017

wesleywiser commented Dec 6, 2017

michaelwoerister commented Dec 6, 2017