Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLL compile-time performance regression #58178

Closed
ehuss opened this issue Feb 5, 2019 · 21 comments
Closed

NLL compile-time performance regression #58178

ehuss opened this issue Feb 5, 2019 · 21 comments
Assignees
Labels
A-NLL Area: Non-lexical lifetimes (NLL) NLL-performant Working towards the "performance is good" goal P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ehuss
Copy link
Contributor

ehuss commented Feb 5, 2019

NLL seems to cause a severe regression in compile time for code generated by the gll crate.

  1. Check out https://github.com/rust-lang-nursery/wg-grammar.git (current hash is e1d072a158a5d1f3d811bbd48bba3608bff1c99f).
  2. cargo build. For me, takes about 1 minute.
  3. Edit src/lib.rs and add #![feature(nll)]
  4. cargo build. For me, takes 77 minutes.

gll generates a very large parse.rs file, about 46,000 lines long. (Located in target/debug/build/rust-grammar-*/out/parse.rs)

From what little debugging I did, I see it spending most of its time in a function called RegionInferenceContext::find_outlives_blame_span.

Tested: 1.32.0 (stable) and rustc 1.34.0-nightly (f6fac42 2019-02-03).

Please let me know if there's any other information I can provide.

cc @qmx

@Mark-Simulacrum
Copy link
Member

@davidtwco davidtwco added A-NLL Area: Non-lexical lifetimes (NLL) NLL-performant Working towards the "performance is good" goal labels Feb 5, 2019
@nnethercote
Copy link
Contributor

Thank you for the detailed steps to reproduce. I had to use cargo +nightly build and then I can confirm the results: 55 seconds for a normal build, 61 minutes for an NLL build.

perf top also agrees that find_outlives_blame_span (or something inlined into it) is the problem.

The generated parse.rs file contains one function (parse) that is 16,000 lines. I wonder if that's the root of the problem.

@ehuss: Using normal profiling tools on a test case that is slow long-running will be painful. Do you have any suggestions for generating test case that is roughly 10--100x smaller? I guess I could try just removing things from grammar/*.lyg but perhaps you can give me more targeted ideas?

@nnethercote
Copy link
Contributor

I did some ad hoc profiling on a partial run and found that find_constraint_paths_between_regions is the critical function. It has an outer loop that is hot and an inner loop that is even hotter.

Here is a partial histogram showing the distribution of loop counts for the inner loop:

260467026 counts:
(  1) 78981836 (30.3%, 30.3%): n2 = 2
(  2) 78236479 (30.0%, 60.4%): n2 = 1
(  3) 24332897 ( 9.3%, 69.7%): n2 = 4
(  4) 22356356 ( 8.6%, 78.3%): n2 = 3
(  5) 13480529 ( 5.2%, 83.5%): n2 = 5
(  6) 10596430 ( 4.1%, 87.5%): n2 = 8
(  7)  8426562 ( 3.2%, 90.8%): n2 = 6
(  8)  5604005 ( 2.2%, 92.9%): n2 = 19
(  9)  5547737 ( 2.1%, 95.0%): n2 = 9
( 10)  2602505 ( 1.0%, 96.0%): n2 = 12

Here is the weighted version of the same histogram:

1139707263 counts:
(  1) 157963672 (13.9%, 13.9%): n2 = 2
(  2) 106476095 ( 9.3%, 23.2%): n2 = 19
(  3) 97331588 ( 8.5%, 31.7%): n2 = 4
(  4) 84771440 ( 7.4%, 39.2%): n2 = 8
(  5) 78844077 ( 6.9%, 46.1%): n2 = 37
(  6) 78236479 ( 6.9%, 53.0%): n2 = 1
(  7) 67402645 ( 5.9%, 58.9%): n2 = 5
(  8) 67069068 ( 5.9%, 64.8%): n2 = 3
(  9) 50559372 ( 4.4%, 69.2%): n2 = 6
( 10) 49929633 ( 4.4%, 73.6%): n2 = 9

The ratios of the "counts" line in each one is 4.4, i.e. the average inner loop count is 4.4, and although it sometimes (not shown in the above data) gets above 300, generally it loops a handful of times.

I also found that the Trace::NotVisited = context[sub_region] condition within the inner loop is only true about 20% of the time.

In contrast, here is a partial histogram showing the distribution of loop counts for the outer loop:

30929 counts:
(  1)      287 ( 0.9%,  0.9%): n1a = 7
(  2)      273 ( 0.9%,  1.8%): n1a = 2
(  3)      239 ( 0.8%,  2.6%): n1a = 5
(  4)      210 ( 0.7%,  3.3%): n1a = 14
(  5)      209 ( 0.7%,  3.9%): n1a = 9
(  6)      198 ( 0.6%,  4.6%): n1a = 8
(  7)      197 ( 0.6%,  5.2%): n1a = 10
(  8)      189 ( 0.6%,  5.8%): n1a = 12
(  9)      180 ( 0.6%,  6.4%): n1a = 11
( 10)      171 ( 0.6%,  7.0%): n1a = 6
( 11)      158 ( 0.5%,  7.5%): n1a = 13
( 12)      127 ( 0.4%,  7.9%): n1a = 660
( 13)      125 ( 0.4%,  8.3%): n1a = 642
( 14)      124 ( 0.4%,  8.7%): n1a = 624
( 15)      124 ( 0.4%,  9.1%): n1a = 152
( 16)      120 ( 0.4%,  9.5%): n1a = 46
( 17)      118 ( 0.4%,  9.9%): n1a = 141
( 18)      113 ( 0.4%, 10.2%): n1a = 674
( 19)      110 ( 0.4%, 10.6%): n1a = 48
( 20)      108 ( 0.3%, 10.9%): n1a = 44

Here is the weighted version of the same histogram:

260475188 counts:
(  1)   567226 ( 0.2%,  0.2%): n1a = 51566
(  2)   567028 ( 0.2%,  0.4%): n1a = 51548
(  3)   566830 ( 0.2%,  0.7%): n1a = 51530
(  4)   566632 ( 0.2%,  0.9%): n1a = 51512
(  5)   566434 ( 0.2%,  1.1%): n1a = 51494
(  6)   566236 ( 0.2%,  1.3%): n1a = 51476
(  7)   565598 ( 0.2%,  1.5%): n1a = 51418
(  8)   565400 ( 0.2%,  1.7%): n1a = 51400
(  9)   565202 ( 0.2%,  2.0%): n1a = 51382
( 10)   565004 ( 0.2%,  2.2%): n1a = 51364
( 11)   564806 ( 0.2%,  2.4%): n1a = 51346
( 12)   564608 ( 0.2%,  2.6%): n1a = 51328
( 13)   563761 ( 0.2%,  2.8%): n1a = 51251
( 14)   563563 ( 0.2%,  3.0%): n1a = 51233
( 15)   563365 ( 0.2%,  3.3%): n1a = 51215
( 16)   562188 ( 0.2%,  3.5%): n1a = 51108
( 17)   561990 ( 0.2%,  3.7%): n1a = 51090
( 18)   561792 ( 0.2%,  3.9%): n1a = 51072
( 19)   561462 ( 0.2%,  4.1%): n1a = 51042
( 20)   561264 ( 0.2%,  4.3%): n1a = 51024

The ratios of the "counts" line in each one is 8421, i.e. the average outer loop count is 8421, and it gets as high as 51566.

@nnethercote
Copy link
Contributor

nnethercote commented Feb 6, 2019

Another data point: the length of context in find_constraint_paths_between_regions (i.e. the number of definitions) is 63577 in the dominant case. I guess this is probably coming from the huge parse function. And then we're calling find_constraint_paths_between_regions lots of times for parse, so it's basically quadratic.

@ehuss
Copy link
Contributor Author

ehuss commented Feb 6, 2019

Thanks for taking a look. It's tricky to create an isolated reproduction. One approach as you mentioned is to just use a smaller grammar. Delete all the files in grammar/ and create a new one (name it foo.lyg or whatever) with these contents: foo.lyg. (You can delete more lines if that is still too slow.) For me, the non-nll build is 2s and the nll build is ~70s.

@nnethercote
Copy link
Contributor

nnethercote commented Feb 6, 2019

I did a Cachegrind run using a cut-down version of foo.lyg. It basically confirmed everything I found above. The cost of traversing the constraint graph is a bit higher than I expected, as these lines show:

            .           impl<'s, D: ConstraintGraphDirecton> Iterator for Edges<'s, D> {
            .               type Item = OutlivesConstraint;
            .               
            .               fn next(&mut self) -> Option<Self::Item> {
1,771,873,465 ( 2.50%)          if let Some(p) = self.pointer {
  708,321,207 ( 1.00%)              self.pointer = self.graph.next_constraints[p];
            .            
9,208,629,436 (13.00%)              Some(self.constraints[p])
  354,512,802 ( 0.50%)          } else if let Some(next_static_idx) = self.next_static_idx {
            .                       self.next_static_idx =
      983,892 ( 0.00%)                  if next_static_idx == (self.graph.first_constraints.len() - 1) {
            .                               None 
            .                           } else {
            .                               Some(next_static_idx + 1)
            .                           };

and

17,710,114,719 (25.00%)              for constraint in self.constraint_graph
             .                           .outgoing_edges(r, &self.constraints, fr_static)
             .                       {
 3,542,079,795 ( 5.00%)                  assert_eq!(constraint.sup, r);
             .                           let sub_region = constraint.sub;
 2,125,247,877 ( 3.00%)                  if let Trace::NotVisited = context[sub_region] {
 1,302,130,110 ( 1.84%)                      context[sub_region] = Trace::FromOutlivesConstraint(constraint);
             .                               deque.push_back(sub_region);
             .                           }
             .                       }

(It's extremely rare to see 25% or 13% of all instructions attributed to a single line of code!)

I can see one or two opportunities to improve things, but they'll only be constant-time shavings. An algorithmic change will be necessary to make things reasonable, and I don't understand this code well enough to come up with anything myself.

@nnethercote
Copy link
Contributor

#58210 is really simple but reduces instruction counts by 22% and wall-time by 8% for compilation of my cut-down version of foo.lyg.

@lqd
Copy link
Member

lqd commented Feb 6, 2019

@matthewjasper and I have also been looking at this, and removing the call to find_outlives_blame_span here shows it's indeed too costly of a function to be called thousands of time (but that should IIUC only happen rarely: when checking huge closures like these here; I've seen one in the logs with 70860 region definitions).

For example, stubbing the call out makes the NLL overhead closer to 10% here (but of course simply removing the call worsens the diagnostics).

Matthew mentioned they'll be looking at this very soon.

@matthewjasper matthewjasper self-assigned this Feb 6, 2019
@estebank
Copy link
Contributor

estebank commented Feb 7, 2019

@lqd could find_outlives_blame_span be called optionally if we could verify ahead of time what the search space is? That way for the common case we still get the nicer error messages with a reasonable level of slowness, while keeping the running time constrained in the worst cases.

@lqd
Copy link
Member

lqd commented Feb 7, 2019

@estebank seems possible to have such a threshold if need be yeah. But Matthew (who, unlike me, has worked on this part of the code before — and which Niko wrote IIRC) also mentioned on Zulip there are other ways to get this diagnostics information.

(Oh and in the little experiment I mentioned earlier, the "worse diagnostics" weren't terrible or anything)

@matthewjasper
Copy link
Contributor

matthewjasper commented Feb 10, 2019

I've opened #58347 which gets the performance to a much better level (less than 50% slower for a full clean debug build). Once that has landed we probably want to add the test to perf.rlo and profile what's causing the remaining slow down.

Centril added a commit to Centril/rust that referenced this issue Feb 14, 2019
… r=pnkfelix

Closure bounds fixes

* Ensures that "nice region errors" are buffered so that they are sorted and migrated correctly.
* Propagates fewer constraints for closures (cc rust-lang#58178)
* Propagate constraints from closures more precisely (rust-lang#58127)

Closes rust-lang#58127

r? @nikomatsakis
Centril added a commit to Centril/rust that referenced this issue Feb 14, 2019
… r=pnkfelix

Closure bounds fixes

* Ensures that "nice region errors" are buffered so that they are sorted and migrated correctly.
* Propagates fewer constraints for closures (cc rust-lang#58178)
* Propagate constraints from closures more precisely (rust-lang#58127)

Closes rust-lang#58127

r? @nikomatsakis
@matthewjasper matthewjasper removed their assignment Feb 14, 2019
@matthewjasper
Copy link
Contributor

matthewjasper commented Feb 14, 2019

With #58347 merged, this should be much better now. Could @Mark-Simulacrum add this to perf.rlo? And does anyone want to investigate the remaining perf gap?

@Mark-Simulacrum
Copy link
Member

I probably won't have a chance for a few days but it should be fairly straightforward to do so - copy files into collector/benchmarks/grammar and file a PR.

@nnethercote
Copy link
Contributor

The input size might need tweaking so that the runtime is reasonable.

@pnkfelix
Copy link
Member

pnkfelix commented Mar 6, 2019

NLL triage. Seems to me like we still want to investigate the remaining performance gap here. Tagging as P-high.

@pnkfelix pnkfelix added the P-high High priority label Mar 6, 2019
@pnkfelix
Copy link
Member

pnkfelix commented Mar 6, 2019

Also nominating for discussion at NLL meeting (mostly to see if I can find someone to continue the investigation of the remaining performance gap).

@pnkfelix
Copy link
Member

pnkfelix commented Mar 6, 2019

assigning to @csmoe after they pinged me on zulip

@pnkfelix
Copy link
Member

un-nominating based on assumption that @csmoe is going to continue investigating.

@matthewjasper
Copy link
Contributor

I did some investigation on this, it appears that we are still spending too much time on the closure bounds. Lazily determining the best constraint seems like the best way forwards here.

@Centril Centril added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label May 29, 2019
@pnkfelix
Copy link
Member

Downgrading from P-high to P-medium; the bulk of the truly awful performance regression has been addressed.

Some issues may remain, but its not clear if they are worth tracking in this specific issue. So I may close this in the future.

Reassigning from @csmoe to self so that I keep track of its status.

@pnkfelix pnkfelix assigned pnkfelix and unassigned csmoe Sep 12, 2019
@pnkfelix pnkfelix added P-medium Medium priority and removed P-high High priority labels Sep 12, 2019
@pnkfelix
Copy link
Member

After further discussion with @Mark-Simulacrum and @nnethercote , closing as essentially fixed (at least to the point of not being worth tracking in this issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-NLL Area: Non-lexical lifetimes (NLL) NLL-performant Working towards the "performance is good" goal P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants