Rewrite the pass that adds mutexes for atomic nodes (#8105) · halide/Halide@bf0d611

Commit

Rewrite the pass that adds mutexes for atomic nodes (#8105)

* Avoid redundant scope lookups

This pattern has been bugging me for a long time:

```
if (scope.contains(key)) {
  Foo f = scope.get(key);
}
```

This redundantly looks up the key in the scope twice. I've finally
gotten around to fixing it. I've introduced a find method that either
returns a const pointer to the value, if it exists, or null. It also
searches any containing scopes, which are held by const pointer, so the
method has to return a const pointer.

```
if (const Foo *f = scope.find(key)) {
}
```

For cases where you want to get and then mutate, I added shallow_find,
which doesn't search enclosing scopes, but returns a mutable pointer.

We were also doing redundant scope lookups in ScopedBinding. We stored
the key in the helper object, and then did a pop on that key in the
ScopedBinding destructor. This commit changes Scope so that Scope::push
returns an opaque token that you can pass to Scope::pop to have it
remove that element without doing a fresh lookup. ScopedBinding now uses
this. Under the hood it's just an iterator on the underlying map (map
iterators are not invalidated on inserting or removing other stuff).

The net effect is to speed up local laplacian lowering by about 5%

I also considered making it look more like an stl class, and having find
return an iterator, but it doesn't really work. The iterator it returns
might point to an entry in an enclosing scope, in which case you can't
compare it to the .end() method of the scope you have. Scopes are
different enough from maps that the interface really needs to be
distinct.

* Pacify clang-tidy

* Rewrite the pass that injects mutexes to support atomics

For O(n) nested allocate nodes, this pass was quadratic in n, even if
there was no use of atomics. This commit rewrites it to use a
linear-time algorithm, and skips it entirely after the first validation
pass if there aren't any atomic nodes.

It also needlessly used IRGraphMutators, which slowed things down,
didn't handle LargeBuffers (could overflow in the allocation),
incorrectly thought every producer/consumer node was associated with an
output buffer, and didn't print the realization name when printing the
atomic node (the body of an atomic node is only atomic w.r.t. a specific
realization).

I noticed all this because it stuck out in a profile. For resnet 50, the
rewrite that changed to a linear algorithm took this stage from 185ms
down to 6.7ms, and then skipping it entirely when it doesn't find any
atomic nodes added 1.5 for the single IRVisitor check.

For local laplacian with 100 pyramid levels (which contains many nested
allocate nodes due to a large number of skip connections), the times are
5846 ms -> 16 ms -> 4.6 ms

This is built on top of #8103

* Fix unintentional mutation of interval in scope

---------

Co-authored-by: Steven Johnson <[email protected]>

Loading branch information

abadams and steven-johnson authored Mar 12, 2024

1 parent 3c2d809 commit bf0d611

0 comments on commit `bf0d611`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `bf0d611`

Commit

There are no files selected for viewing

0 comments on commit bf0d611

0 comments on commit `bf0d611`