-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in (partitioned) generated loopy kernels on multiple runs #498
Comments
Using (perhaps) persistent hashes of DAGs and/or Loopy kernels, what is the first point at which the IRs diverge? |
Small update: With #459, #465, and #505, rhs kernels differ already in the (unpartioned) input Run 1:
Run 2:
Note that in this example, ranks 0 and 3 have different hashes between runs, but ranks 1 and 2 match. https://gist.github.com/matthiasdiener/96066e1e61125fe0e6c8b9e9514a69a8 has more investigation regarding this test. |
Even with #465, generated loopy kernels often differ across different runs of the same multi-rank application.
Consider
mpirun -n 4 python -m mpi4py mirgecom/examples/wave.py --lazy
which produces partially different kernels on most runs.As an example, one of the post-partition rhs kernels for rank 1 is:
On a subsequent run, the same kernel is generated with the following diff:
Note that after linearization, both generated kernels are 100% identical, but they differ in arguments, domains and instructions.
I am not sure at the moment how to debug this further.
This happens even with
PYTHONHASHSEED
set (i.e., it is not an issue of strings being stored in sets).It is possible that this issue is caused by an earlier stage in the pipeline (e.g., in meshmode), but it seems to be restricted to multi-rank runs.
The text was updated successfully, but these errors were encountered: