Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChannelInvalidStateError: avoid excepting the job when this happens #5031

Closed
rubel75 opened this issue Aug 1, 2021 · 4 comments
Closed

ChannelInvalidStateError: avoid excepting the job when this happens #5031

rubel75 opened this issue Aug 1, 2021 · 4 comments
Labels

Comments

@rubel75
Copy link

rubel75 commented Aug 1, 2021

I encountered a ChannelInvalidStateError that was previously described in https://groups.google.com/g/aiidausers/c/O9Z47vTnji4
The full error message is listed at the end of this post. The problem occurs after the main WIEN2k calculation step finishes run_lapw (see plugin details at https://github.com/rubel75/aiida-wien2k). It shows up for small calculations not frequently (~5% cases), but for a larger case (more k-points) it happens in 50% cases. These calculations belong to the artificial oxides group. The calculations are run in a direct mode (only 4 cores are used from 16 available). The AiiDA daemon is configured with 2 workers.

Any ideas are welcome, including those on how to better debug this error.

Thank you in advance
Oleg

Steps to reproduce

The challenge is that it a modified WIEN2k version is needed (currently not available to the public).

Otherwise steps are:

  1. Execute git clone https://github.com/rubel75/aiida-wien2k
  2. Install aiida-wien2k with pip install -e .
  3. Install codes and computer from aiida-wien2k/aiida_wien2k/configs/...
  4. cd aiida-wien2k/etc/
  5. verdi run launch_oxides_scf_workchain.py

Expected behavior

You will see Wien2kRunLapw jobs Excepted and new processes are Created, but not running (same as in https://groups.google.com/g/aiidausers/c/O9Z47vTnji4). At the same time verdi status is all good.

(aiida) rubel@psi11:~/aiida-wien2k/etc> verdi process list -a
   PK  Created    Process label       Process State    Process status
-----  ---------  ------------------  ---------------  ---------------------------------------
20109  49m ago    Wien2kScfWorkChain  ⏹ Finished [0]
20110  49m ago    Wien2kXSgroup       ⏹ Finished [0]
20115  48m ago    Wien2kInitLapw      ⏹ Finished [0]
20120  48m ago    Wien2kScfWorkChain  ⨯ Excepted
20121  48m ago    Wien2kXSgroup       ⏹ Finished [0]
20125  48m ago    Wien2kRunLapw       ⏹ Finished [0]
20130  48m ago    Wien2kScfWorkChain  ⏹ Finished [0]
20133  48m ago    Wien2kXSgroup       ⏹ Finished [0]
20134  48m ago    Wien2kInitLapw      ⏹ Finished [0]
20141  48m ago    Wien2kScfWorkChain  ⏵ Waiting        Waiting for child processes: 20159
20142  48m ago    Wien2kXSgroup       ⏹ Finished [0]
20150  47m ago    Wien2kInitLapw      ⏹ Finished [0]
20151  47m ago    Wien2kRunLapw       ⨯ Excepted
20152  47m ago    Wien2kInitLapw      ⏹ Finished [0]
20158  47m ago    Wien2kRunLapw       ⏹ Finished [0]
20159  47m ago    Wien2kRunLapw       ⏵ Waiting        Monitoring scheduler: job state RUNNING
20167  22m ago    Wien2kScfWorkChain  ⨯ Excepted
20168  22m ago    Wien2kXSgroup       ⏹ Finished [0]
20173  22m ago    Wien2kInitLapw      ⏹ Finished [0]
20176  21m ago    Wien2kRunLapw       ⨯ Excepted
20183  18m ago    Wien2kScfWorkChain  ⨯ Excepted
20184  18m ago    Wien2kXSgroup       ⏹ Finished [0]
20189  18m ago    Wien2kInitLapw      ⨯ Excepted
20195  16m ago    Wien2kScfWorkChain  ⏹ Created
20199  7m ago     Wien2kScfWorkChain  ⏹ Created
20203  7m ago     Wien2kScfWorkChain  ⏹ Created

Environment

  • AiiDA version 1.6.4
  • Linux
  • Python 3.8.10
  • postgres (PostgreSQL) 13.3
  • rabbitmqctl version 3.8.19

Additional context

+-> ERROR at 2021-08-01 01:44:08.170868+00:00
 | Traceback (most recent call last):
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/manage/external/rmq.py", line 208, in _continue
 |     result = await super()._continue(communicator, pid, nowait, tag)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/process_comms.py", line 613, in _continue
 |     await proc.step_until_terminated()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 1230, in step_until_terminated
 |     await self.step()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 1216, in step
 |     self.transition_to(next_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
 |     self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
 |     raise exception.with_traceback(trace)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 320, in transition_to
 |     self._enter_next_state(new_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 386, in _enter_next_state
 |     self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 299, in _fire_state_event
 |     callback(self, hook, state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 326, in <lambda>
 |     lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/processes/process.py", line 390, in on_entered
 |     super().on_entered(from_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 700, in on_entered
 |     self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/communications.py", line 175, in broadcast_send
 |     return self._communicator.broadcast_send(body, sender, subject, correlation_id)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/threadcomms.py", line 227, in broadcast_send
 |     result = self._loop_scheduler.await_(
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/pytray/aiothreads.py", line 159, in await_
 |     return self.await_submit(awaitable).result(timeout=self.task_timeout)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/concurrent/futures/_base.py", line 444, in result
 |     return self.__get_result()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
 |     raise self._exception
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/pytray/aiothreads.py", line 36, in done
 |     result = done_future.result()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/asyncio/futures.py", line 178, in result
 |     raise self._exception
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/asyncio/tasks.py", line 280, in __step
 |     result = coro.send(None)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/pytray/aiothreads.py", line 178, in proxy
 |     return await awaitable
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/communicator.py", line 488, in broadcast_send
 |     result = await publisher.broadcast_send(body, sender, subject, correlation_id)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/communicator.py", line 66, in broadcast_send
 |     return await self.publish(message, routing_key=defaults.BROADCAST_TOPIC, mandatory=False)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/messages.py", line 209, in publish
 |     result = await self._exchange.publish(message, routing_key=routing_key, mandatory=mandatory)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aio_pika/exchange.py", line 233, in publish
 |     return await asyncio.wait_for(
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
 |     return await fut
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiormq/channel.py", line 508, in basic_publish
 |     async with self.lock:
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiormq/channel.py", line 90, in lock
 |     raise ChannelInvalidStateError("%r closed" % self)
 | aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed
@giovannipizzi
Copy link
Member

Thanks for the report @rubel75 !

Are you running the simulations on the same computer as AiiDA? The direct scheduler is not really scheduling, but just a quick workaround to avoid having to install a proper scheduler.
But it is not suggested to use it for production runs, as all jobs will run at the same time.
You might even see low CPU usage, but this is still not a good indication: most probably your computer is busy with I/O (if you have many Wien2K files creating files all at the same time), or just requesting a lot of memory, that means most of the memory is being moved to the swap, that makes your computer a few orders of magnitude slower than usual.

Also, the ChannelInvalidStateError are errors in the connection to RabbitMQ so not really a transport problem. I think (but I'm not sure) that it's because of the high workload of the computer.

So I think there are two issues:

  1. I would keep this as an issue, and ensure that AiiDA doesn't except jobs on this exception, but just pauses them (so you can restart them when the load is lower). Pinging @chrisjsewell about this - @chrisjsewell did you already have the time to look into whether this is possible? I think this is an important issue to prioritise, since it's happening so often to @rubel75 and it's blocking for the common-workflows project (was there already another issue? Otherwise we can keep this one?)
  2. Practically, you should go via a queuing system. It's not too hard to setup, and then you can submit many more jobs via AiiDA, and let the scheduler do its job. There is no plan to implement this directly in AiiDA since this is precisely the task of of a job scheduler and we do not have the resources to invest in re-implementing a scheduler, when there are already so many good ones (like SLURM). We can follow up in private if you need pointers on how to set one up (if you understand ansible syntax, this Quantum Mobile role can help, or there are many guides on how to setup SLURM online).

@giovannipizzi giovannipizzi changed the title ChannelInvalidStateError: Transport problem? ChannelInvalidStateError: avoid excepting the job when this happens Aug 3, 2021
@rubel75
Copy link
Author

rubel75 commented Aug 23, 2021

Thank you @giovannipizzi for the suggestions. I tried running via slurm. The problem persists. Also, I tried to increase the number of aiida daemon workers (2 -> 6) to match the number of concurrent processes. Also, I used vmstat to monitor CPU (~50%), memory (~25%), I/O load, swaping (none). I enclosed the error message below. Any help/suggestions are appreciated.

Oleg

(aiida) rubel@psi11:~/aiida-wien2k> verdi process report 21485
*** 21485: CalcJobState.STASHING
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 5 LOG MESSAGES:
+-> ERROR at 2021-08-23 01:48:21.539628+00:00
 | Traceback (most recent call last):
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/manage/external/rmq.py", line 208, in _continue
 |     result = await super()._continue(communicator, pid, nowait, tag)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/process_comms.py", line 613, in _continue
 |     await proc.step_until_terminated()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 1230, in step_until_terminated
 |     await self.step()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 1216, in step
 |     self.transition_to(next_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
 |     self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
 |     raise exception.with_traceback(trace)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 320, in transition_to
 |     self._enter_next_state(new_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 386, in _enter_next_state
 |     self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 299, in _fire_state_event
 |     callback(self, hook, state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 326, in <lambda>
 |     lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/processes/process.py", line 390, in on_entered
 |     super().on_entered(from_state)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/processes.py", line 700, in on_entered
 |     self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/plumpy/communications.py", line 175, in broadcast_send
 |     return self._communicator.broadcast_send(body, sender, subject, correlation_id)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/threadcomms.py", line 227, in broadcast_send
 |     result = self._loop_scheduler.await_(
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/pytray/aiothreads.py", line 159, in await_
 |     return self.await_submit(awaitable).result(timeout=self.task_timeout)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/concurrent/futures/_base.py", line 444, in result
 |     return self.__get_result()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
 |     raise self._exception
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/pytray/aiothreads.py", line 36, in done
 |     result = done_future.result()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/asyncio/futures.py", line 178, in result
 |     raise self._exception
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/asyncio/tasks.py", line 280, in __step
 |     result = coro.send(None)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/pytray/aiothreads.py", line 178, in proxy
 |     return await awaitable
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/communicator.py", line 488, in broadcast_send
 |     result = await publisher.broadcast_send(body, sender, subject, correlation_id)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/communicator.py", line 66, in broadcast_send
 |     return await self.publish(message, routing_key=defaults.BROADCAST_TOPIC, mandatory=False)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/kiwipy/rmq/messages.py", line 209, in publish
 |     result = await self._exchange.publish(message, routing_key=routing_key, mandatory=mandatory)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aio_pika/exchange.py", line 233, in publish
 |     return await asyncio.wait_for(
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
 |     return await fut
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiormq/channel.py", line 508, in basic_publish
 |     async with self.lock:
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiormq/channel.py", line 90, in lock
 |     raise ChannelInvalidStateError("%r closed" % self)
 | aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed
+-> ERROR at 2021-08-23 01:48:27.164389+00:00
 | Traceback (most recent call last):
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 194, in do_update
 |     node.set_scheduler_state(JobState.DONE)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/nodes/process/calculation/calcjob.py", line 403, in set_scheduler_state
 |     self.set_attribute(self.SCHEDULER_STATE_KEY, state.value)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/utils/mixins.py", line 176, in set_attribute
 |     raise exceptions.ModificationNotAllowed('attributes of a sealed node are immutable')
 | aiida.common.exceptions.ModificationNotAllowed: attributes of a sealed node are immutable
+-> ERROR at 2021-08-23 01:48:48.405732+00:00
 | Traceback (most recent call last):
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 194, in do_update
 |     node.set_scheduler_state(JobState.DONE)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/nodes/process/calculation/calcjob.py", line 403, in set_scheduler_state
 |     self.set_attribute(self.SCHEDULER_STATE_KEY, state.value)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/utils/mixins.py", line 176, in set_attribute
 |     raise exceptions.ModificationNotAllowed('attributes of a sealed node are immutable')
 | aiida.common.exceptions.ModificationNotAllowed: attributes of a sealed node are immutable
+-> ERROR at 2021-08-23 01:49:32.056153+00:00
 | Traceback (most recent call last):
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 194, in do_update
 |     node.set_scheduler_state(JobState.DONE)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/nodes/process/calculation/calcjob.py", line 403, in set_scheduler_state
 |     self.set_attribute(self.SCHEDULER_STATE_KEY, state.value)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/utils/mixins.py", line 176, in set_attribute
 |     raise exceptions.ModificationNotAllowed('attributes of a sealed node are immutable')
 | aiida.common.exceptions.ModificationNotAllowed: attributes of a sealed node are immutable
+-> ERROR at 2021-08-23 01:51:00.015328+00:00
 | Traceback (most recent call last):
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 194, in do_update
 |     node.set_scheduler_state(JobState.DONE)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/nodes/process/calculation/calcjob.py", line 403, in set_scheduler_state
 |     self.set_attribute(self.SCHEDULER_STATE_KEY, state.value)
 |   File "/eos/rubel/anaconda3/envs/aiida/lib/python3.8/site-packages/aiida/orm/utils/mixins.py", line 176, in set_attribute
 |     raise exceptions.ModificationNotAllowed('attributes of a sealed node are immutable')
 | aiida.common.exceptions.ModificationNotAllowed: attributes of a sealed node are immutable

@rubel75
Copy link
Author

rubel75 commented Sep 2, 2021

The issue is related to RMQ consumer_timeout setting (new from version 3.8.*). Emanuele Bosoni proposed this fix on Slack:

  1. I created a file with name rabbitmq.conf that contains the string consumer_timeout = 36000000 (this sets the timeout to 10h).
  2. I placed this file in the folder where the plugin file is. You can find out where this folder is typing rabbitmq-diagnostics status and looking at the second section called "plugins". In my case this folder was /etc/rabbitmq/

The fix is not perfect, but works for my jobs under 10 hrs. I close the issue since Chris opened a separate issue here #5105

@ltalirz
Copy link
Member

ltalirz commented Mar 7, 2022

To our knowledge, is there any way to figure out whether the correct consumer_timeout is applied to a particular queue, other than submitting a >30min job?

I've updated the instructions in the wiki; they include a way to check that the configuration file is seen and read by rabbitmq but what I'm missing is a way to tell whether the setting is being applied to the queue relevant to AiiDA (e.g. if it was created before the change of the global setting).

I didn't find anything in the rabbitmq documentation. I tried looking inside the .DCD files inside the ~/.rabbitmq folder of the aiida-core docker container and didn't notice anything related to "timeout".
Do we know whether this is a setting that can be modified at the queue level?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants