You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Global reductions such as nodal_min must compute and evaluate the local reduction before passing the result to MPI. The code for these local reductions is not compiled, so It seems like they would incur the cost of generating the kernel each time the reduction is called? Should they perhaps have a memoized compile on the inside similar to compiled_lsrk45_step instead?
(Not sure if this has any real performance impact, I just noticed it while discussing with @MTCam and thought it was worth mentioning.)
The text was updated successfully, but these errors were encountered:
This is a good question. There certainly is a performance impact, and so we definitely want to move away from the current approach of using freeze/thaw. That said, there is not currently code to deal with reductions as far as DAG transformation is concerned; there are two specific pieces missing:
We likely want to expose distributed-memory (effectively MPI) reduce/allreduce as a DAG node. We could roll our own using point-to-point and a tree to avoid the need for a new node type, but I think that's not a good idea.
Single-GPU global reductions need a transform path. Loopy can do those transformations, we just need to make sure they happen.
I don't think compiling internally is a good idea, as it would effectively cement the notion that reductions are evaluated eagerly. That would preclude incorporating reductions into larger DAGs, while I think that's actually desirable.
Global reductions such as
nodal_min
must compute and evaluate the local reduction before passing the result to MPI. The code for these local reductions is not compiled, so It seems like they would incur the cost of generating the kernel each time the reduction is called? Should they perhaps have a memoizedcompile
on the inside similar tocompiled_lsrk45_step
instead?(Not sure if this has any real performance impact, I just noticed it while discussing with @MTCam and thought it was worth mentioning.)
The text was updated successfully, but these errors were encountered: