-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChannelInvalidStateError: avoid excepting the job when this happens #5031
Comments
Thanks for the report @rubel75 ! Are you running the simulations on the same computer as AiiDA? The direct scheduler is not really scheduling, but just a quick workaround to avoid having to install a proper scheduler. Also, the ChannelInvalidStateError are errors in the connection to RabbitMQ so not really a transport problem. I think (but I'm not sure) that it's because of the high workload of the computer. So I think there are two issues:
|
Thank you @giovannipizzi for the suggestions. I tried running via slurm. The problem persists. Also, I tried to increase the number of aiida daemon workers (2 -> 6) to match the number of concurrent processes. Also, I used Oleg
|
The issue is related to RMQ
The fix is not perfect, but works for my jobs under 10 hrs. I close the issue since Chris opened a separate issue here #5105 |
To our knowledge, is there any way to figure out whether the correct I've updated the instructions in the wiki; they include a way to check that the configuration file is seen and read by rabbitmq but what I'm missing is a way to tell whether the setting is being applied to the queue relevant to AiiDA (e.g. if it was created before the change of the global setting). I didn't find anything in the rabbitmq documentation. I tried looking inside the .DCD files inside the |
I encountered a
ChannelInvalidStateError
that was previously described in https://groups.google.com/g/aiidausers/c/O9Z47vTnji4The full error message is listed at the end of this post. The problem occurs after the main WIEN2k calculation step finishes
run_lapw
(see plugin details at https://github.com/rubel75/aiida-wien2k). It shows up for small calculations not frequently (~5% cases), but for a larger case (more k-points) it happens in 50% cases. These calculations belong to the artificial oxides group. The calculations are run in a direct mode (only 4 cores are used from 16 available). The AiiDA daemon is configured with 2 workers.Any ideas are welcome, including those on how to better debug this error.
Thank you in advance
Oleg
Steps to reproduce
The challenge is that it a modified WIEN2k version is needed (currently not available to the public).
Otherwise steps are:
git clone https://github.com/rubel75/aiida-wien2k
aiida-wien2k
withpip install -e .
aiida-wien2k/aiida_wien2k/configs/...
cd aiida-wien2k/etc/
verdi run launch_oxides_scf_workchain.py
Expected behavior
You will see
Wien2kRunLapw
jobsExcepted
and new processes areCreated
, but not running (same as in https://groups.google.com/g/aiidausers/c/O9Z47vTnji4). At the same timeverdi status
is all good.Environment
Additional context
The text was updated successfully, but these errors were encountered: