Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream from RelationalAI: Salsa performance improvements: Remove allocations #27

Merged
merged 22 commits into from
Jul 13, 2020

Conversation

NHDaly
Copy link
Collaborator

@NHDaly NHDaly commented Jul 13, 2020

Upstream from RelationalAI

Use the benchmarks added in #23 to improve salsa perf.

Ultimately, most of this work turned out to be finding and removing allocations in the DefaultStorage implementation. Some of those allocations were from type instability, which I was able to fix via some small silly things. A lot of the improvement came from implementing the TODO optimization ideas, which reduced allocations by being smarter about reusing existing instances of DerivedValue, and the vectors contained in them.


Here are the latest results from master on the Salsa benchmarks, introduced in #23:

name value unit
default_storage.simple_bench:N=10,iters=1000.allocs 513366  
default_storage.simple_bench:N=10,iters=1000.bytes 12853824 bytes
default_storage.simple_bench:N=10,iters=1000.gc_time 0 s
default_storage.simple_bench:N=10,iters=1000.time 0.057029926 s

And here are the results from the latest build on this branch:

name value unit
default_storage.simple_bench:N=10,iters=1000.allocs 143404  
default_storage.simple_bench:N=10,iters=1000.bytes 3108672 bytes
default_storage.simple_bench:N=10,iters=1000.gc_time 0 s
default_storage.simple_bench:N=10,iters=1000.time 0.032309521 s

NHDaly and others added 22 commits July 12, 2020 23:03
_Slightly_ reduces allocations for simple benchmark
Add type assertion in get_map_for_key, where apparently it can't deduce
the type returned by `get!()`, and reduce allocations with a known key
from 4 -> 2 and an unknown key from 11 -> 9! 😬
Since we need to keep a count of all active derived functions (only for
the assertions), we can reduce lock contention by making this an Atomic
instead of locking on every access.
This prevents array growth events on the preallocated traces, since
we're expecting them to receive a non-0 number of elements.
Since the Salsa stack traces are implemented via a linked-list (in order
to support branching for calls on new threads), they necessarily
allocate (2 allocs) on each `@derived function` call. HOWEVER, it's
worth noting that this allocation is _extremely_ cheap. So we might
consider re-enabling this, even in debug mode, since it really is nice
to do early cycle detection and pretty stack trace printing.
Somehow, just the presence of the type annotation with a `where T`
caused allocations, that disappear when this is removed. I'm not at all
sure why.
Reduce allocs and time further by typing the Derived functions map with
the expected, computed return type

Interestingly, this still has 7 allocations when storing a new derived
value, whereas the alternative approach of switching to a single giant
dict has only 3, but this is still faster.
This also improves performance by reducing the time spent constructing
types at runtime, which is surprisingly expensive! :)

This is a simplification we've wanted to do for a while.
In new get_user_function: fully qualify dispatch key type, so we're not
overwriting methods.
For newly simplified key types: Fix `==` and `<` to return false for
keys with different function types.
It seems to run forever in Delve trying to compute the return types of
the user's code. I'm hopeful that the one at the top level should be
okay, but we'll see.
I'm not entirely sure why. I think at least some of them come from the
keys always being abstract types, where in the old code they were always
fully typed. And I think some of them come from this Base.ht_keyindex
which is intermittently re-indexing the whole dict, and i'm not sure
why. In my simple test the runtime only ever sees one Derived function,
so i'd think the dicts should be the same size, with and without this
commit... Very interesting.

Before:
```julia
julia> let rt2 = Runtime()  # 3 allocs per new isbits value stored
           @time for i in 1:1_000_000; derived_add1(rt2, i) end
       end
  0.892621 seconds (3.00 M allocations: 202.491 MiB)

julia> let rt2 = Runtime()  # 4 allocs per new non-isbits value stored
           @time for i in 1:100_000;
               derived_length(rt2, Int[i])
           end
       end
  0.149560 seconds (500.04 k allocations: 30.083 MiB)
```

After:
```julia
julia> let rt2 = Runtime()  # Less predicatble performance, and maybe 9ish allocs per value?
          @time @Profile for i in 1:1_000_000; derived_add1(rt2, i) end
       end
  3.323522 seconds (9.42 M allocations: 300.486 MiB)

julia> let rt2 = Runtime()  # Here, only 6ish per value? Less predicatble, for sure.
           @time for i in 1:100_000;
               derived_length(rt2, Int[i])
           end
       end
  0.144503 seconds (615.98 k allocations: 31.852 MiB)
```
…ore allocs"

This reverts commit 113404038b016a6efa124aa37bd6d544ed75e276.
We added return_type inference to make derived functions type stable,
since julia of course has no way to know what types of values will be
found in the salsa caches.

However, this appears to never terminate (or at least, times out after
an hour on the build farm), so this commit disables it. We'll have to
rely on user-specified return type annotations for now, despite the
negative affect those can have on performance.
This commit increases re-uses of objects in two places to allocations:
1. Even if the value has changed, if we already have an existing_value,
   we can reuse it and just modify its fields, rather than creating a new
   DerivedValue instance.
2. We can skip allocating another vector for the dependency tracking by
   swapping the vector with that of the existing_value since we're going
   to be reusing it anyway, so that this way it gets written directly
   in-place and we don't need to copy the results out and then clear the
   trace vector. :)
…alid check.

Given very wide derived function with lots of dependencies:
```julia
julia> Salsa.@derived derived_add1(rt,x) = (x+1)
derived_add1 (generic function with 1 method)

julia> @derived f(rt, a, b) = sum(derived_add1(rt, i) for i in a:b)
f (generic function with 1 method)
```

Before:
```julia
julia> let rt = Runtime()
           @time for i in 1:100
               set_x!(rt, i)  # unused, but triggers still_valid check
               f(rt, 1, 1000)
           end
       end
  0.076925 seconds (394.98 k allocations: 7.729 MiB)
```

After:
```julia
julia> let rt = Runtime()
           @time for i in 1:100
               set_x!(rt, i)  # unused, but triggers still_valid check
               f(rt, 1, 1000)
           end
       end
  0.068495 seconds (106.39 k allocations: 3.362 MiB, 12.79% gc time)
```
It's causing a StackOverflow when run with -O0 on our CI
@NHDaly NHDaly added performance CPU, RAM, disk utilization and efficiency upstream labels Jul 13, 2020
@NHDaly NHDaly merged commit 820ae69 into master Jul 13, 2020
NHDaly added a commit that referenced this pull request Jul 15, 2020
The changes in #27 changed the way we represent keys, and I forgot to
update the Inspect code to still be able to print a graph.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance CPU, RAM, disk utilization and efficiency upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant