Upstream from RelationalAI: Salsa performance improvements: Remove allocations #27

NHDaly · 2020-07-13T03:57:46Z

Upstream from RelationalAI

Use the benchmarks added in #23 to improve salsa perf.

Ultimately, most of this work turned out to be finding and removing allocations in the DefaultStorage implementation. Some of those allocations were from type instability, which I was able to fix via some small silly things. A lot of the improvement came from implementing the TODO optimization ideas, which reduced allocations by being smarter about reusing existing instances of DerivedValue, and the vectors contained in them.

Here are the latest results from master on the Salsa benchmarks, introduced in #23:

name	value	unit
default_storage.simple_bench:N=10,iters=1000.allocs	513366
default_storage.simple_bench:N=10,iters=1000.bytes	12853824	bytes
default_storage.simple_bench:N=10,iters=1000.gc_time	0	s
default_storage.simple_bench:N=10,iters=1000.time	0.057029926	s

And here are the results from the latest build on this branch:

name	value	unit
default_storage.simple_bench:N=10,iters=1000.allocs	143404
default_storage.simple_bench:N=10,iters=1000.bytes	3108672	bytes
default_storage.simple_bench:N=10,iters=1000.gc_time	0	s
default_storage.simple_bench:N=10,iters=1000.time	0.032309521	s

_Slightly_ reduces allocations for simple benchmark

Add type assertion in get_map_for_key, where apparently it can't deduce the type returned by `get!()`, and reduce allocations with a known key from 4 -> 2 and an unknown key from 11 -> 9! 😬

Since we need to keep a count of all active derived functions (only for the assertions), we can reduce lock contention by making this an Atomic instead of locking on every access.

This prevents array growth events on the preallocated traces, since we're expecting them to receive a non-0 number of elements.

Since the Salsa stack traces are implemented via a linked-list (in order to support branching for calls on new threads), they necessarily allocate (2 allocs) on each `@derived function` call. HOWEVER, it's worth noting that this allocation is _extremely_ cheap. So we might consider re-enabling this, even in debug mode, since it really is nice to do early cycle detection and pretty stack trace printing.

Somehow, just the presence of the type annotation with a `where T` caused allocations, that disappear when this is removed. I'm not at all sure why.

Reduce allocs and time further by typing the Derived functions map with the expected, computed return type Interestingly, this still has 7 allocations when storing a new derived value, whereas the alternative approach of switching to a single giant dict has only 3, but this is still faster.

This also improves performance by reducing the time spent constructing types at runtime, which is surprisingly expensive! :) This is a simplification we've wanted to do for a while.

In new get_user_function: fully qualify dispatch key type, so we're not overwriting methods. For newly simplified key types: Fix `==` and `<` to return false for keys with different function types.

It seems to run forever in Delve trying to compute the return types of the user's code. I'm hopeful that the one at the top level should be okay, but we'll see.

@time

I'm not entirely sure why. I think at least some of them come from the keys always being abstract types, where in the old code they were always fully typed. And I think some of them come from this Base.ht_keyindex which is intermittently re-indexing the whole dict, and i'm not sure why. In my simple test the runtime only ever sees one Derived function, so i'd think the dicts should be the same size, with and without this commit... Very interesting. Before: ```julia julia> let rt2 = Runtime() # 3 allocs per new isbits value stored @time for i in 1:1_000_000; derived_add1(rt2, i) end end 0.892621 seconds (3.00 M allocations: 202.491 MiB) julia> let rt2 = Runtime() # 4 allocs per new non-isbits value stored @time for i in 1:100_000; derived_length(rt2, Int[i]) end end 0.149560 seconds (500.04 k allocations: 30.083 MiB) ``` After: ```julia julia> let rt2 = Runtime() # Less predicatble performance, and maybe 9ish allocs per value? @time @Profile for i in 1:1_000_000; derived_add1(rt2, i) end end 3.323522 seconds (9.42 M allocations: 300.486 MiB) julia> let rt2 = Runtime() # Here, only 6ish per value? Less predicatble, for sure. @time for i in 1:100_000; derived_length(rt2, Int[i]) end end 0.144503 seconds (615.98 k allocations: 31.852 MiB) ```

…ore allocs" This reverts commit 113404038b016a6efa124aa37bd6d544ed75e276.

We added return_type inference to make derived functions type stable, since julia of course has no way to know what types of values will be found in the salsa caches. However, this appears to never terminate (or at least, times out after an hour on the build farm), so this commit disables it. We'll have to rely on user-specified return type annotations for now, despite the negative affect those can have on performance.

This commit increases re-uses of objects in two places to allocations: 1. Even if the value has changed, if we already have an existing_value, we can reuse it and just modify its fields, rather than creating a new DerivedValue instance. 2. We can skip allocating another vector for the dependency tracking by swapping the vector with that of the existing_value since we're going to be reusing it anyway, so that this way it gets written directly in-place and we don't need to copy the results out and then clear the trace vector. :)

@derived

…alid check. Given very wide derived function with lots of dependencies: ```julia julia> Salsa.@derived derived_add1(rt,x) = (x+1) derived_add1 (generic function with 1 method) julia> @derived f(rt, a, b) = sum(derived_add1(rt, i) for i in a:b) f (generic function with 1 method) ``` Before: ```julia julia> let rt = Runtime() @time for i in 1:100 set_x!(rt, i) # unused, but triggers still_valid check f(rt, 1, 1000) end end 0.076925 seconds (394.98 k allocations: 7.729 MiB) ``` After: ```julia julia> let rt = Runtime() @time for i in 1:100 set_x!(rt, i) # unused, but triggers still_valid check f(rt, 1, 1000) end end 0.068495 seconds (106.39 k allocations: 3.362 MiB, 12.79% gc time) ```

It's causing a StackOverflow when run with -O0 on our CI

The changes in #27 changed the way we represent keys, and I forgot to update the Inspect code to still be able to print a graph.

NHDaly and others added 22 commits July 12, 2020 23:03

Use Base.return_type to deduce derived func return type

9c2d553

_Slightly_ reduces allocations for simple benchmark

Salsa: Add type assertion to reduce allocations by 2!

a160dfc

Add type assertion in get_map_for_key, where apparently it can't deduce the type returned by `get!()`, and reduce allocations with a known key from 4 -> 2 and an unknown key from 11 -> 9! 😬

Decrease Salsa DefaultStorage lock contention via Atomic

ae37237

Since we need to keep a count of all active derived functions (only for the assertions), we can reduce lock contention by making this an Atomic instead of locking on every access.

Reduce allocs in Salsa derived funcs: pre-size Traces to hold 30 funcs

c8ad531

This prevents array growth events on the preallocated traces, since we're expecting them to receive a non-0 number of elements.

Salsa: Simplify logic in memoized_lookup (perf neutral)

db5b4a9

Salsa perf: remove type annotations from Salsa._unwrap_salsa_value().

2368d1d

Somehow, just the presence of the type annotation with a `where T` caused allocations, that disappear when this is removed. I'm not at all sure why.

Salsa: Finally massively simplify DependencyKey types.

a29ff3e

This also improves performance by reducing the time spent constructing types at runtime, which is surprisingly expensive! :) This is a simplification we've wanted to do for a while.

Salsa: Fix bugs from Salsa perf changes

9e47e92

In new get_user_function: fully qualify dispatch key type, so we're not overwriting methods. For newly simplified key types: Fix `==` and `<` to return false for keys with different function types.

Fix Salsa tests by typing the DerivedValues w/ ResultType

782b5b4

Use Base at-lock instead of lock-do to improve performance

d3837fd

Disable return_types() inside memoized_lookup because it's hanging.

1404abe

It seems to run forever in Delve trying to compute the return types of the user's code. I'm hopeful that the one at the top level should be okay, but we'll see.

Revert "Tried to use one large dict for derived vals, but it causes m…

9b3c7c8

…ore allocs" This reverts commit 113404038b016a6efa124aa37bd6d544ed75e276.

Merge branch 'master' into nhd-Salsa-performance-tuning

3ed4c0d

Salsa: Add comment for prealloc traces

43167f8

Coallesce all the nospecialize comments into a single comment at top

2aa4d27

Disable salsa traces doubling test.

25f67c9

It's causing a StackOverflow when run with -O0 on our CI

NHDaly added performance CPU, RAM, disk utilization and efficiency upstream labels Jul 13, 2020

NHDaly merged commit 820ae69 into master Jul 13, 2020

NHDaly mentioned this pull request Jul 13, 2020

DefaultStorage Performance Optimization: reduce/reuse Salsa traces #20

Closed

NHDaly added a commit that referenced this pull request Jul 15, 2020

Fix up Salsa.Inspect.build_graph after DefaultStorage perf changes

f1e3be7

The changes in #27 changed the way we represent keys, and I forgot to update the Inspect code to still be able to print a graph.

ghost mentioned this pull request Jul 15, 2020

Fix up Salsa.Inspect.build_graph after DefaultStorage perf changes #29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream from RelationalAI: Salsa performance improvements: Remove allocations #27

Upstream from RelationalAI: Salsa performance improvements: Remove allocations #27

NHDaly commented Jul 13, 2020

Upstream from RelationalAI: Salsa performance improvements: Remove allocations #27

Upstream from RelationalAI: Salsa performance improvements: Remove allocations #27

Conversation

NHDaly commented Jul 13, 2020

Upstream from RelationalAI