Should global reductions `compile` internally? #274

majosm · 2022-07-07T19:51:26Z

Global reductions such as nodal_min must compute and evaluate the local reduction before passing the result to MPI. The code for these local reductions is not compiled, so It seems like they would incur the cost of generating the kernel each time the reduction is called? Should they perhaps have a memoized compile on the inside similar to compiled_lsrk45_step instead?

(Not sure if this has any real performance impact, I just noticed it while discussing with @MTCam and thought it was worth mentioning.)

The text was updated successfully, but these errors were encountered:

inducer · 2022-07-10T10:33:39Z

This is a good question. There certainly is a performance impact, and so we definitely want to move away from the current approach of using freeze/thaw. That said, there is not currently code to deal with reductions as far as DAG transformation is concerned; there are two specific pieces missing:

We likely want to expose distributed-memory (effectively MPI) reduce/allreduce as a DAG node. We could roll our own using point-to-point and a tree to avoid the need for a new node type, but I think that's not a good idea.
Single-GPU global reductions need a transform path. Loopy can do those transformations, we just need to make sure they happen.

I don't think compiling internally is a good idea, as it would effectively cement the notion that reductions are evaluated eagerly. That would preclude incorporating reductions into larger DAGs, while I think that's actually desirable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should global reductions `compile` internally? #274

Should global reductions `compile` internally? #274

majosm commented Jul 7, 2022

inducer commented Jul 10, 2022

Should global reductions compile internally? #274

Should global reductions compile internally? #274

Comments

majosm commented Jul 7, 2022

inducer commented Jul 10, 2022

Should global reductions `compile` internally? #274

Should global reductions `compile` internally? #274