-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream from RelationalAI: Salsa performance improvements: Remove allocations #27
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
_Slightly_ reduces allocations for simple benchmark
Add type assertion in get_map_for_key, where apparently it can't deduce the type returned by `get!()`, and reduce allocations with a known key from 4 -> 2 and an unknown key from 11 -> 9! 😬
Since we need to keep a count of all active derived functions (only for the assertions), we can reduce lock contention by making this an Atomic instead of locking on every access.
This prevents array growth events on the preallocated traces, since we're expecting them to receive a non-0 number of elements.
Since the Salsa stack traces are implemented via a linked-list (in order to support branching for calls on new threads), they necessarily allocate (2 allocs) on each `@derived function` call. HOWEVER, it's worth noting that this allocation is _extremely_ cheap. So we might consider re-enabling this, even in debug mode, since it really is nice to do early cycle detection and pretty stack trace printing.
Somehow, just the presence of the type annotation with a `where T` caused allocations, that disappear when this is removed. I'm not at all sure why.
Reduce allocs and time further by typing the Derived functions map with the expected, computed return type Interestingly, this still has 7 allocations when storing a new derived value, whereas the alternative approach of switching to a single giant dict has only 3, but this is still faster.
This also improves performance by reducing the time spent constructing types at runtime, which is surprisingly expensive! :) This is a simplification we've wanted to do for a while.
In new get_user_function: fully qualify dispatch key type, so we're not overwriting methods. For newly simplified key types: Fix `==` and `<` to return false for keys with different function types.
It seems to run forever in Delve trying to compute the return types of the user's code. I'm hopeful that the one at the top level should be okay, but we'll see.
I'm not entirely sure why. I think at least some of them come from the keys always being abstract types, where in the old code they were always fully typed. And I think some of them come from this Base.ht_keyindex which is intermittently re-indexing the whole dict, and i'm not sure why. In my simple test the runtime only ever sees one Derived function, so i'd think the dicts should be the same size, with and without this commit... Very interesting. Before: ```julia julia> let rt2 = Runtime() # 3 allocs per new isbits value stored @time for i in 1:1_000_000; derived_add1(rt2, i) end end 0.892621 seconds (3.00 M allocations: 202.491 MiB) julia> let rt2 = Runtime() # 4 allocs per new non-isbits value stored @time for i in 1:100_000; derived_length(rt2, Int[i]) end end 0.149560 seconds (500.04 k allocations: 30.083 MiB) ``` After: ```julia julia> let rt2 = Runtime() # Less predicatble performance, and maybe 9ish allocs per value? @time @Profile for i in 1:1_000_000; derived_add1(rt2, i) end end 3.323522 seconds (9.42 M allocations: 300.486 MiB) julia> let rt2 = Runtime() # Here, only 6ish per value? Less predicatble, for sure. @time for i in 1:100_000; derived_length(rt2, Int[i]) end end 0.144503 seconds (615.98 k allocations: 31.852 MiB) ```
…ore allocs" This reverts commit 113404038b016a6efa124aa37bd6d544ed75e276.
We added return_type inference to make derived functions type stable, since julia of course has no way to know what types of values will be found in the salsa caches. However, this appears to never terminate (or at least, times out after an hour on the build farm), so this commit disables it. We'll have to rely on user-specified return type annotations for now, despite the negative affect those can have on performance.
This commit increases re-uses of objects in two places to allocations: 1. Even if the value has changed, if we already have an existing_value, we can reuse it and just modify its fields, rather than creating a new DerivedValue instance. 2. We can skip allocating another vector for the dependency tracking by swapping the vector with that of the existing_value since we're going to be reusing it anyway, so that this way it gets written directly in-place and we don't need to copy the results out and then clear the trace vector. :)
…alid check. Given very wide derived function with lots of dependencies: ```julia julia> Salsa.@derived derived_add1(rt,x) = (x+1) derived_add1 (generic function with 1 method) julia> @derived f(rt, a, b) = sum(derived_add1(rt, i) for i in a:b) f (generic function with 1 method) ``` Before: ```julia julia> let rt = Runtime() @time for i in 1:100 set_x!(rt, i) # unused, but triggers still_valid check f(rt, 1, 1000) end end 0.076925 seconds (394.98 k allocations: 7.729 MiB) ``` After: ```julia julia> let rt = Runtime() @time for i in 1:100 set_x!(rt, i) # unused, but triggers still_valid check f(rt, 1, 1000) end end 0.068495 seconds (106.39 k allocations: 3.362 MiB, 12.79% gc time) ```
It's causing a StackOverflow when run with -O0 on our CI
NHDaly
added a commit
that referenced
this pull request
Jul 15, 2020
The changes in #27 changed the way we represent keys, and I forgot to update the Inspect code to still be able to print a graph.
ghost
mentioned this pull request
Jul 15, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Upstream from RelationalAI
Use the benchmarks added in #23 to improve salsa perf.
Ultimately, most of this work turned out to be finding and removing allocations in the DefaultStorage implementation. Some of those allocations were from type instability, which I was able to fix via some small silly things. A lot of the improvement came from implementing the TODO optimization ideas, which reduced allocations by being smarter about reusing existing instances of DerivedValue, and the vectors contained in them.
Here are the latest results from
master
on the Salsa benchmarks, introduced in #23:And here are the results from the latest build on this branch: