-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ESPResSo test PR #144
Conversation
Currently this PR includes files such as the job script and the benchmarking file which will be removed later. Right now it is here for reference. |
sub-dirs as part of the package. Sanity for the weak scaling is already added. Next step is to extract the performance timing and log it.
print("Executing sanity checks...\n") | ||
if (np.all([np.allclose(energy, ref_energy, atol=atol_energy, rtol=rtol_energy), | ||
np.allclose(p_scalar, np.trace(ref_pressure) / 3., | ||
atol=atol_pressure, rtol=rtol_pressure), | ||
np.allclose(p_tensor, ref_pressure, atol=atol_pressure, rtol=rtol_pressure), | ||
np.allclose(forces, 0., atol=atol_forces, rtol=rtol_forces), | ||
np.allclose(np.median(np.abs(forces)), 0., atol=atol_abs_forces, rtol=rtol_abs_forces)])): | ||
print("Final convergence met with tolerances: \n\ | ||
energy: ", atol_energy, "\n\ | ||
p_scalar: ", atol_pressure, "\n\ | ||
p_tensor: ", atol_pressure, "\n\ | ||
forces: ", atol_forces, "\n\ | ||
abs_forces: ", atol_abs_forces, "\n") | ||
else: | ||
print("At least one parameter did not meet the tolerance, see the log above.\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sanity checks have actually already executed at lines 111 to 116, and they will interrupt the Python interpreter with an exception if any check fails, so I would suspect the else
branch is unreachable. I would also recommend against re-expressing the assertions as np.allclose
in the conditional to avoid redundancy and prevent the risk that the assertions and conditional diverge over time, for example due to changes to tolerance values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jngrad ,
I have used the same values for tolerances in the both the assertions so the values should not diverge between the assertions and the conditional.
Do these assertions also end the python executions? In that case, I will move the conditional above your original assertions so that the execution can also reach the else part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used the same values for tolerances in the both the assertions so the values should not diverge between the assertions and the conditional.
They don't diverge today. But they might in a month's time if multiple people contribute to this file, or if you forget that the conditional block must exactly mirror the assertion block above it.
Do these assertions also end the python executions?
np.testing.assert_allclose()
raises an AssertionError
which halts the Python interpreter with a non-zero exit code.
In that case, I will move the conditional above your original assertions so that the execution can also reach the else part of the code.
Is the else
branch truly needed? np.testing.assert_allclose()
already generates a clear error message:
Traceback (most recent call last):
File "/work/jgrad/espresso/src/madelung.py", line 116, in <module>
np.testing.assert_allclose(np.median(np.abs(forces)), 0., atol=atol_abs_forces, rtol=rtol_abs_forces)
File "/tikhome/jgrad/.local/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/tikhome/jgrad/.local/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0, atol=2e-06
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 2.1e-06
Max relative difference: inf
x: array(2.1e-06)
y: array(0.)
I do not totally understand the purpose of this if/else
statement. If you need to log the tolerances to stdout
, you can do so independently of the success or failure of the assertions, since they are constants. If you need to report the success of failure of the assertions, that is already done by numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the assertion reports a failure in the manner that you have pointed out but how can I extract a success? Or printing a message right below it would suffice? Since it exits the program anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can print("Success")
or check if Python returned exit code 0.
if pathlib.Path(args.output).is_file(): | ||
header = "" | ||
with open(args.output, "a") as f: | ||
f.write(header + report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This write operation is superfluous if the ReFrame runner captures stdout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have to remove this and is a part of my TODO. :)
I guess the |
First run on hortense:
|
|
I now ran it on another partition with some more memory: all the tests that do not request a or multiple nodes passed. The following scales The Now waiting on the |
scale
|
Yes, it was a mistake which I intend to correct. :) |
@run_after('init') | ||
def set_mem(self): | ||
""" Setting an extra job option of memory. """ | ||
self.extra_resources = {'memory': {'size': '50GB'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use
test-suite/eessi/testsuite/hooks.py
Line 386 in b0c91e4
def req_memory_per_node(test: rfm.RegressionTest, app_mem_req): |
Also, I assume the memory requirement isn't a fixed 50GB, but depends on the scale at which this is run (i.e. number of tasks)? Or doesn't it? If it does, please define an approximate function to compute the memory requirement as a function of task count. It's fine if it is somewhat conservative (i.e. asks for too much), but be aware that the test will be skipped on systems where insufficient memory is available (so don't over-do it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the problem causing OOM and I have fixed it. I am yet to push it into the PR as I am making some more incremental changes to it. I am also observing some crashes on zen4
such as:
https://gitlab.com/eessi/support/-/issues/37#note_1927317164
2. Increased time limit to account for tuning that takes longer on large number of cores.
temporary fix until the mesh size can be fixed based on extrapolation.
madelung.py for the linter.
@laraPPr can you test this PR at your local system again? The 16 node test takes way too long for this PR to run. I had a discussion with @jngrad and we have a strategy to overcome this limitation but that will be a part of a subsequent PR to this test. Other than that the other tests should be manageable. Other than that I have re-requested a review for this PR from @casparvl and @jngrad . |
Latest result on Snellius |
|
Everything except the 16_node (which I cancelled) one passes with this fix #151
|
Just for logging this here: I encountered
On Vega. That's why Satish made the change in ef21ed5 to 1) use the I've just resubmitted the tests, I can confirm that I no longer get the |
dialing down the scales within the CI tests since they should not take too much time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include eessi/testsuite/tests/apps/espresso/src/job.sh
?
I don't see it being used at all by the test...
Maybe it's better to have a README file with some info on how to run this manually (using EESSI, not Spack ;) )
removing the statement from madelung that puts benchmark.csv as path within the output parameter.
Ok, retested on final time. The bigger tests haven't completed yet, but the tests that have finished look good. Good enough for me. |
ESPResSo P3M weak-scaling and strong scaling test.