NLL compile-time performance regression #58178

ehuss · 2019-02-05T05:53:14Z

NLL seems to cause a severe regression in compile time for code generated by the gll crate.

Check out https://github.com/rust-lang-nursery/wg-grammar.git (current hash is e1d072a158a5d1f3d811bbd48bba3608bff1c99f).
cargo build. For me, takes about 1 minute.
Edit src/lib.rs and add #![feature(nll)]
cargo build. For me, takes 77 minutes.

gll generates a very large parse.rs file, about 46,000 lines long. (Located in target/debug/build/rust-grammar-*/out/parse.rs)

From what little debugging I did, I see it spending most of its time in a function called RegionInferenceContext::find_outlives_blame_span.

Tested: 1.32.0 (stable) and rustc 1.34.0-nightly (f6fac42 2019-02-03).

Please let me know if there's any other information I can provide.

cc @qmx

The text was updated successfully, but these errors were encountered:

Mark-Simulacrum · 2019-02-05T06:14:04Z

Cc @nnethercote @pnkfelix @nikomatsakis

nnethercote · 2019-02-06T01:04:47Z

Thank you for the detailed steps to reproduce. I had to use cargo +nightly build and then I can confirm the results: 55 seconds for a normal build, 61 minutes for an NLL build.

perf top also agrees that find_outlives_blame_span (or something inlined into it) is the problem.

The generated parse.rs file contains one function (parse) that is 16,000 lines. I wonder if that's the root of the problem.

@ehuss: Using normal profiling tools on a test case that is slow long-running will be painful. Do you have any suggestions for generating test case that is roughly 10--100x smaller? I guess I could try just removing things from grammar/*.lyg but perhaps you can give me more targeted ideas?

nnethercote · 2019-02-06T03:08:26Z

I did some ad hoc profiling on a partial run and found that find_constraint_paths_between_regions is the critical function. It has an outer loop that is hot and an inner loop that is even hotter.

Here is a partial histogram showing the distribution of loop counts for the inner loop:

260467026 counts:
(  1) 78981836 (30.3%, 30.3%): n2 = 2
(  2) 78236479 (30.0%, 60.4%): n2 = 1
(  3) 24332897 ( 9.3%, 69.7%): n2 = 4
(  4) 22356356 ( 8.6%, 78.3%): n2 = 3
(  5) 13480529 ( 5.2%, 83.5%): n2 = 5
(  6) 10596430 ( 4.1%, 87.5%): n2 = 8
(  7)  8426562 ( 3.2%, 90.8%): n2 = 6
(  8)  5604005 ( 2.2%, 92.9%): n2 = 19
(  9)  5547737 ( 2.1%, 95.0%): n2 = 9
( 10)  2602505 ( 1.0%, 96.0%): n2 = 12

Here is the weighted version of the same histogram:

1139707263 counts:
(  1) 157963672 (13.9%, 13.9%): n2 = 2
(  2) 106476095 ( 9.3%, 23.2%): n2 = 19
(  3) 97331588 ( 8.5%, 31.7%): n2 = 4
(  4) 84771440 ( 7.4%, 39.2%): n2 = 8
(  5) 78844077 ( 6.9%, 46.1%): n2 = 37
(  6) 78236479 ( 6.9%, 53.0%): n2 = 1
(  7) 67402645 ( 5.9%, 58.9%): n2 = 5
(  8) 67069068 ( 5.9%, 64.8%): n2 = 3
(  9) 50559372 ( 4.4%, 69.2%): n2 = 6
( 10) 49929633 ( 4.4%, 73.6%): n2 = 9

The ratios of the "counts" line in each one is 4.4, i.e. the average inner loop count is 4.4, and although it sometimes (not shown in the above data) gets above 300, generally it loops a handful of times.

I also found that the Trace::NotVisited = context[sub_region] condition within the inner loop is only true about 20% of the time.

In contrast, here is a partial histogram showing the distribution of loop counts for the outer loop:

30929 counts:
(  1)      287 ( 0.9%,  0.9%): n1a = 7
(  2)      273 ( 0.9%,  1.8%): n1a = 2
(  3)      239 ( 0.8%,  2.6%): n1a = 5
(  4)      210 ( 0.7%,  3.3%): n1a = 14
(  5)      209 ( 0.7%,  3.9%): n1a = 9
(  6)      198 ( 0.6%,  4.6%): n1a = 8
(  7)      197 ( 0.6%,  5.2%): n1a = 10
(  8)      189 ( 0.6%,  5.8%): n1a = 12
(  9)      180 ( 0.6%,  6.4%): n1a = 11
( 10)      171 ( 0.6%,  7.0%): n1a = 6
( 11)      158 ( 0.5%,  7.5%): n1a = 13
( 12)      127 ( 0.4%,  7.9%): n1a = 660
( 13)      125 ( 0.4%,  8.3%): n1a = 642
( 14)      124 ( 0.4%,  8.7%): n1a = 624
( 15)      124 ( 0.4%,  9.1%): n1a = 152
( 16)      120 ( 0.4%,  9.5%): n1a = 46
( 17)      118 ( 0.4%,  9.9%): n1a = 141
( 18)      113 ( 0.4%, 10.2%): n1a = 674
( 19)      110 ( 0.4%, 10.6%): n1a = 48
( 20)      108 ( 0.3%, 10.9%): n1a = 44

Here is the weighted version of the same histogram:

260475188 counts:
(  1)   567226 ( 0.2%,  0.2%): n1a = 51566
(  2)   567028 ( 0.2%,  0.4%): n1a = 51548
(  3)   566830 ( 0.2%,  0.7%): n1a = 51530
(  4)   566632 ( 0.2%,  0.9%): n1a = 51512
(  5)   566434 ( 0.2%,  1.1%): n1a = 51494
(  6)   566236 ( 0.2%,  1.3%): n1a = 51476
(  7)   565598 ( 0.2%,  1.5%): n1a = 51418
(  8)   565400 ( 0.2%,  1.7%): n1a = 51400
(  9)   565202 ( 0.2%,  2.0%): n1a = 51382
( 10)   565004 ( 0.2%,  2.2%): n1a = 51364
( 11)   564806 ( 0.2%,  2.4%): n1a = 51346
( 12)   564608 ( 0.2%,  2.6%): n1a = 51328
( 13)   563761 ( 0.2%,  2.8%): n1a = 51251
( 14)   563563 ( 0.2%,  3.0%): n1a = 51233
( 15)   563365 ( 0.2%,  3.3%): n1a = 51215
( 16)   562188 ( 0.2%,  3.5%): n1a = 51108
( 17)   561990 ( 0.2%,  3.7%): n1a = 51090
( 18)   561792 ( 0.2%,  3.9%): n1a = 51072
( 19)   561462 ( 0.2%,  4.1%): n1a = 51042
( 20)   561264 ( 0.2%,  4.3%): n1a = 51024

The ratios of the "counts" line in each one is 8421, i.e. the average outer loop count is 8421, and it gets as high as 51566.

nnethercote · 2019-02-06T03:34:58Z

Another data point: the length of context in find_constraint_paths_between_regions (i.e. the number of definitions) is 63577 in the dominant case. I guess this is probably coming from the huge parse function. And then we're calling find_constraint_paths_between_regions lots of times for parse, so it's basically quadratic.

ehuss · 2019-02-06T04:33:16Z

Thanks for taking a look. It's tricky to create an isolated reproduction. One approach as you mentioned is to just use a smaller grammar. Delete all the files in grammar/ and create a new one (name it foo.lyg or whatever) with these contents: foo.lyg. (You can delete more lines if that is still too slow.) For me, the non-nll build is 2s and the nll build is ~70s.

nnethercote · 2019-02-06T05:37:26Z

I did a Cachegrind run using a cut-down version of foo.lyg. It basically confirmed everything I found above. The cost of traversing the constraint graph is a bit higher than I expected, as these lines show:

            .           impl<'s, D: ConstraintGraphDirecton> Iterator for Edges<'s, D> {
            .               type Item = OutlivesConstraint;
            .               
            .               fn next(&mut self) -> Option<Self::Item> {
1,771,873,465 ( 2.50%)          if let Some(p) = self.pointer {
  708,321,207 ( 1.00%)              self.pointer = self.graph.next_constraints[p];
            .            
9,208,629,436 (13.00%)              Some(self.constraints[p])
  354,512,802 ( 0.50%)          } else if let Some(next_static_idx) = self.next_static_idx {
            .                       self.next_static_idx =
      983,892 ( 0.00%)                  if next_static_idx == (self.graph.first_constraints.len() - 1) {
            .                               None 
            .                           } else {
            .                               Some(next_static_idx + 1)
            .                           };

and

17,710,114,719 (25.00%)              for constraint in self.constraint_graph
             .                           .outgoing_edges(r, &self.constraints, fr_static)
             .                       {
 3,542,079,795 ( 5.00%)                  assert_eq!(constraint.sup, r);
             .                           let sub_region = constraint.sub;
 2,125,247,877 ( 3.00%)                  if let Trace::NotVisited = context[sub_region] {
 1,302,130,110 ( 1.84%)                      context[sub_region] = Trace::FromOutlivesConstraint(constraint);
             .                               deque.push_back(sub_region);
             .                           }
             .                       }

(It's extremely rare to see 25% or 13% of all instructions attributed to a single line of code!)

I can see one or two opportunities to improve things, but they'll only be constant-time shavings. An algorithmic change will be necessary to make things reasonable, and I don't understand this code well enough to come up with anything myself.

nnethercote · 2019-02-06T08:08:03Z

#58210 is really simple but reduces instruction counts by 22% and wall-time by 8% for compilation of my cut-down version of foo.lyg.

lqd · 2019-02-06T09:24:04Z

@matthewjasper and I have also been looking at this, and removing the call to find_outlives_blame_span here shows it's indeed too costly of a function to be called thousands of time (but that should IIUC only happen rarely: when checking huge closures like these here; I've seen one in the logs with 70860 region definitions).

For example, stubbing the call out makes the NLL overhead closer to 10% here (but of course simply removing the call worsens the diagnostics).

Matthew mentioned they'll be looking at this very soon.

estebank · 2019-02-07T10:22:50Z

@lqd could find_outlives_blame_span be called optionally if we could verify ahead of time what the search space is? That way for the common case we still get the nicer error messages with a reasonable level of slowness, while keeping the running time constrained in the worst cases.

lqd · 2019-02-07T11:29:24Z

@estebank seems possible to have such a threshold if need be yeah. But Matthew (who, unlike me, has worked on this part of the code before — and which Niko wrote IIRC) also mentioned on Zulip there are other ways to get this diagnostics information.

(Oh and in the little experiment I mentioned earlier, the "worse diagnostics" weren't terrible or anything)

matthewjasper · 2019-02-10T12:12:40Z

I've opened #58347 which gets the performance to a much better level (less than 50% slower for a full clean debug build). Once that has landed we probably want to add the test to perf.rlo and profile what's causing the remaining slow down.

@nikomatsakis

… r=pnkfelix Closure bounds fixes * Ensures that "nice region errors" are buffered so that they are sorted and migrated correctly. * Propagates fewer constraints for closures (cc rust-lang#58178) * Propagate constraints from closures more precisely (rust-lang#58127) Closes rust-lang#58127 r? @nikomatsakis

@nikomatsakis

… r=pnkfelix Closure bounds fixes * Ensures that "nice region errors" are buffered so that they are sorted and migrated correctly. * Propagates fewer constraints for closures (cc rust-lang#58178) * Propagate constraints from closures more precisely (rust-lang#58127) Closes rust-lang#58127 r? @nikomatsakis

matthewjasper · 2019-02-14T13:52:27Z

With #58347 merged, this should be much better now. Could @Mark-Simulacrum add this to perf.rlo? And does anyone want to investigate the remaining perf gap?

Mark-Simulacrum · 2019-02-14T14:09:05Z

I probably won't have a chance for a few days but it should be fairly straightforward to do so - copy files into collector/benchmarks/grammar and file a PR.

nnethercote · 2019-02-14T20:22:14Z

The input size might need tweaking so that the runtime is reasonable.

pnkfelix · 2019-03-06T16:08:51Z

NLL triage. Seems to me like we still want to investigate the remaining performance gap here. Tagging as P-high.

pnkfelix · 2019-03-06T16:12:45Z

Also nominating for discussion at NLL meeting (mostly to see if I can find someone to continue the investigation of the remaining performance gap).

pnkfelix · 2019-03-06T19:31:20Z

assigning to @csmoe after they pinged me on zulip

pnkfelix · 2019-03-20T15:43:24Z

un-nominating based on assumption that @csmoe is going to continue investigating.

matthewjasper · 2019-03-27T22:19:39Z

I did some investigation on this, it appears that we are still spending too much time on the closure bounds. Lazily determining the best constraint seems like the best way forwards here.

pnkfelix · 2019-09-12T13:08:07Z

Downgrading from P-high to P-medium; the bulk of the truly awful performance regression has been addressed.

Some issues may remain, but its not clear if they are worth tracking in this specific issue. So I may close this in the future.

Reassigning from @csmoe to self so that I keep track of its status.

pnkfelix · 2019-09-13T08:47:06Z

After further discussion with @Mark-Simulacrum and @nnethercote , closing as essentially fixed (at least to the point of not being worth tracking in this issue).

davidtwco added A-NLL Area: Non-lexical lifetimes (NLL) NLL-performant Working towards the "performance is good" goal labels Feb 5, 2019

matthewjasper added the I-nominated label Feb 5, 2019

matthewjasper self-assigned this Feb 6, 2019

matthewjasper mentioned this issue Feb 10, 2019

Closure bounds fixes #58347

Merged

pnkfelix mentioned this issue Feb 10, 2019

NLL: turn on borrowck=migrate by default on 2015 edition #57804

Closed

matthewjasper removed the I-nominated label Feb 11, 2019

matthewjasper removed their assignment Feb 14, 2019

matthewjasper mentioned this issue Feb 17, 2019

Add wg-grammar benchmark rust-lang/rustc-perf#343

Merged

pnkfelix added the P-high High priority label Mar 6, 2019

pnkfelix added the I-nominated label Mar 6, 2019

pnkfelix assigned csmoe Mar 6, 2019

pnkfelix removed the I-nominated label Mar 20, 2019

david-mcgillicuddy-moixa mentioned this issue May 29, 2019

NLL exponential(?) increase in compile times when nesting borrowed futures #61324

Open

Centril added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label May 29, 2019

pnkfelix assigned pnkfelix and unassigned csmoe Sep 12, 2019

pnkfelix added P-medium Medium priority and removed P-high High priority labels Sep 12, 2019

pnkfelix closed this as completed Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLL compile-time performance regression #58178

NLL compile-time performance regression #58178

ehuss commented Feb 5, 2019

Mark-Simulacrum commented Feb 5, 2019

nnethercote commented Feb 6, 2019

nnethercote commented Feb 6, 2019

nnethercote commented Feb 6, 2019 •

edited

Loading

ehuss commented Feb 6, 2019

nnethercote commented Feb 6, 2019 •

edited

Loading

nnethercote commented Feb 6, 2019

lqd commented Feb 6, 2019

estebank commented Feb 7, 2019

lqd commented Feb 7, 2019

matthewjasper commented Feb 10, 2019 •

edited

Loading

matthewjasper commented Feb 14, 2019 •

edited

Loading

Mark-Simulacrum commented Feb 14, 2019

nnethercote commented Feb 14, 2019

pnkfelix commented Mar 6, 2019

pnkfelix commented Mar 6, 2019

pnkfelix commented Mar 6, 2019

pnkfelix commented Mar 20, 2019

matthewjasper commented Mar 27, 2019

pnkfelix commented Sep 12, 2019

pnkfelix commented Sep 13, 2019

NLL compile-time performance regression #58178

NLL compile-time performance regression #58178

Comments

ehuss commented Feb 5, 2019

Mark-Simulacrum commented Feb 5, 2019

nnethercote commented Feb 6, 2019

nnethercote commented Feb 6, 2019

nnethercote commented Feb 6, 2019 • edited Loading

ehuss commented Feb 6, 2019

nnethercote commented Feb 6, 2019 • edited Loading

nnethercote commented Feb 6, 2019

lqd commented Feb 6, 2019

estebank commented Feb 7, 2019

lqd commented Feb 7, 2019

matthewjasper commented Feb 10, 2019 • edited Loading

matthewjasper commented Feb 14, 2019 • edited Loading

Mark-Simulacrum commented Feb 14, 2019

nnethercote commented Feb 14, 2019

pnkfelix commented Mar 6, 2019

pnkfelix commented Mar 6, 2019

pnkfelix commented Mar 6, 2019

pnkfelix commented Mar 20, 2019

matthewjasper commented Mar 27, 2019

pnkfelix commented Sep 12, 2019

pnkfelix commented Sep 13, 2019

nnethercote commented Feb 6, 2019 •

edited

Loading

nnethercote commented Feb 6, 2019 •

edited

Loading

matthewjasper commented Feb 10, 2019 •

edited

Loading

matthewjasper commented Feb 14, 2019 •

edited

Loading