-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with new asyncio daemon (stress-test) #4595
Comments
(Since I hit CTRL+C only once yesterday night, and restarted the daemon once less than 1h ago, these shouldn't be connected to an action from my side). |
I have many (most?) of the calculations and workflows excepted.
I get
as you see most of them failed, many without even an exit status ( Example of a 401
and the corresponding failed (excepted) calculation has:
and
So it failed because an internal step excepted. Example of a
|
While this seems to be a quite important bug that happens often and has quite some important consequences for users, the "good news" is that from this simple analysis it seems to be mostly generated by the same |
A final comment there are a few more e.g. this:
The error is different, and while the workchain is failed, there some leftover orphan calculations still queued or running on the supercomputer:
and
|
@giovannipizzi thanks for such elaborate bug report! As for the 401 and None cases, after reading the traceback, though I'm not sure what's going on here, it reminds me of a piece of code haunt hideously in my head. In the traceback, the program complaint when it adding the rpc subscriber to the communicator(a I'm not sure this relates to this issue, but it might be an inspiration for how to debug this. @muhrin is more expert in this, looking forward to him for inputs. |
Thanks @unkcpz ! |
Thanks for the report @giovannipizzi. The exit codes here are not really useful. All those exit codes of the The most critical one seems to be:
Here there seems to be something wrong with the connection to RabbitMQ and so any operation that involves it will raise an exception that will cause the process to fall over. The other one:
is likely just a result of the former one. A process was waiting on a sub process, which excepted due to the first, causing the associated future to be cancelled and here I think we might not yet be catching the correct exceptions. Either because it has been changed, or because multiple ones can be thrown and we didn't expect this one. |
OK - I'll try to get more information about the actual errors/exceptions. Anyway, would it be possible (at least for now, or maybe even in the mid term) to 'catch' these specific exceptions and just consider them as connection failures, so AiiDA's retry mechanisms is triggered and eventually the process is paused if the problem persists? I guess these are errors that can occur, e.g. if the RMQ goes down (or if it's on a different machine and the network goes down) so we should be able to allow operations to restart once the connection is re-established instead of just excepting the whole process? |
That would be ideal, but I think this is not going be straightforward to implement because we cannot put this into plumpy where we catch all exceptions and retry. Some exceptions are real and need to fail the process. It won't be possible to have plumpy automatically determine which failures can be transient and need to be "retried", where the last is in brackets because even that is not straightforward. The exponential backoff mechanism is something we implemented completely on top of the |
Hi Gio, If there's any way you can still reproduce this then it might be worth looking at whether there could be a connection between the In aiormq there are only a few ways that the So, if you can reproduce it, would you be able to enable logging for The only other ways I can see, at the moment, that the [1] This |
Hi Martin, |
Hi @giovannipizzi, the logger setter of |
Just a short message to confirm that, with a "low throughput" (100 WorkChains/~300 processes running at the same time), everything works fine (I submitted a total of ~1000 work chains). |
first to note, there is now an then for the In try:
identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
self.add_cleanup(functools.partial(self._communicator.remove_rpc_subscriber, identifier))
except kiwipy.TimeoutError:
self.logger.exception('Process<%s>: failed to register as an RPC subscriber', self.pid) this means that process does not have except, but the trade-off is that it will be unreachable if trying to use Alternatively, at a "higher level", perhaps it would be better not to ignore these exceptions but rather, in aiida-core/aiida/manage/external/rmq.py Line 206 in c07e3ef
If I understand correctly, this would then mean that the process will not be started running and the continue task would be punted back to rabbitmq, to re-broadcast again to the daemon workers. |
from #4598 in respect to: try:
identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
self.add_cleanup(functools.partial(self._communicator.remove_rpc_subscriber, identifier))
except kiwipy.TimeoutError:
self.logger.exception('Process<%s>: failed to register as an RPC subscriber', self.pid)
try:
identifier = self._communicator.add_broadcast_subscriber(
self.broadcast_receive, identifier=str(self.pid)
)
self.add_cleanup(functools.partial(self._communicator.remove_broadcast_subscriber, identifier))
except kiwipy.TimeoutError:
self.logger.exception('Process<%s>: failed to register as a broadcast subscriber', self.pid) I guess there should be some thought on what happens if |
Yes, if in |
Same story here, it would definitely be good to handle this more gracefully and reject the task instead of excepting the process. However, I feel also here it would be very important to find the underlying cause of this problem because it indicates a bigger bug elsewhere. |
thanks for the reply @sphuber
note this is not currently the case for aiida-core, because the node is set as excepted and sealed for every exception: aiida-core/aiida/manage/external/rmq.py Lines 211 to 214 in c07e3ef
i.e. here we should at least catch
Indeed, if it is a "permanently broken" communicator then I guess we should just kill the daemon worker.
indeed but its just so hard to debug, because it is so hard to consistently replicate. |
I know, I said if we were to start doing that, which to me seems like a good idea indeed, provided that we fix the underlying problem of a potentially "permanently" broken communicator. Here we should consult with @muhrin because recent versions of |
So just to consolidate known failure modes when adding the process rpc/broadcast reciever:
|
Could we add an option to the communicator in kiwipy, to add a callback to |
|
the next part of my master plan lol: aiidateam/plumpy#213 |
I encounter the error
The 11057 is a RemoteData node. This makes no sense to me why this node will report the error. Since this issue thread diverge to general discussion about the related problem, do I need to create a specific issue for the |
The |
Ah~ true, my bad. |
Hey guys, so what is the current wisdom on |
Thanks @ltalirz . This is my traceback:
I was trying to submit the calculation
which was taken from the tutorial. There is no obvious error message in my case.
My environment:
|
From #5031 (comment) do I understand correctly that this issue stems from #5105 ? I.e. that the solution is to downgrade to rabbitmq 3.7.x (or edit the config file as described in #5031 (comment)) ? Edit: Just to mention that for me on Ubuntu 20.04 with rabbitmq 3.8.2 the following runs fine from aiida import load_profile, orm, plugins, engine
load_profile()
#builder = orm.Code.get_from_string('pw-6.3@TheHive').get_builder()
builder = orm.Code.get_from_string('qe@localhost').get_builder()
# BaTiO3 cubic structure
alat = 4. # angstrom
cell = [[alat, 0., 0.], [0., alat, 0.], [0., 0., alat]]
s = plugins.DataFactory('structure')(cell=cell)
s.append_atom(position=(0., 0., 0.), symbols='Ba')
s.append_atom(position=(alat / 2., alat / 2., alat / 2.), symbols='Ti')
s.append_atom(position=(alat / 2., alat / 2., 0.), symbols='O')
s.append_atom(position=(alat / 2., 0., alat / 2.), symbols='O')
s.append_atom(position=(0., alat / 2., alat / 2.), symbols='O')
builder.structure = s
builder.pseudos = orm.load_group('SSSP/1.1/PBE/efficiency').get_pseudos(structure=s)
builder.parameters = plugins.DataFactory('dict')(
dict={
'CONTROL': {
'calculation': 'scf',
'restart_mode': 'from_scratch',
'wf_collect': True,
},
'SYSTEM': {
'ecutwfc': 30.,
'ecutrho': 240.,
},
'ELECTRONS': {
'conv_thr': 1.e-6,
}
}
)
kpoints = plugins.DataFactory('array.kpoints')()
kpoints.set_kpoints_mesh([4, 4, 4])
builder.kpoints = kpoints
builder.metadata.label = 'BaTiO3 test run'
builder.metadata.options.resources = {'num_machines': 1}
builder.metadata.options.max_wallclock_seconds = 1800
builder.metadata.options.prepend_text = "export OMP_NUM_THREADS=1;"
calc = engine.submit(builder)
print(f'created calculation with PK={calc.pk}') |
I tried downgrading to rabbitmq-server=3.7.28 (installed via conda) but the error message was the same. The error occurs very early in the submission. Edit: kiwipy test suite runs fine
|
There is an open issue of |
It is great news that you can reproduce this issue with such a small process submission. I'd propose you keep the environment untouched for a while. Pining @chrisjsewell @sphuber to have a look at this? |
Thanks for looking into this! I restarted my verdi shell and the problem seems to have gone away. |
P.S. Ryota mentions this was with rabbitmq installed via conda |
I now change to using aiida-hyperqueue to submit my calculations which mediate the stress issue that some of job stay in the slurm queue for too long (FYI, I use the rabbitmq v3.9.3 install by conda which has the problem #5031 and I change the timeout to 10h). I then never see the So for this |
Thanks for the hint @unkcpz - hyperqueue might be a good idea for some use cases but I still think the issue needs to be addressed at the aiida-core level as well. @chrisjsewell @sphuber Could one of you please comment on the current status of this issue, including whether the current If you are still looking for user input on this issue, could you provide instructions for users?
plus which log messages you are looking for. Thanks! |
As far as I know, there were no changes in |
The There is a fix ITISFoundation/osparc-simcore#2780 for the |
Thanks @unkcpz for looking into this! Indeed, after reading through mosquito/aio-pika#288 it seems possible that aio-pika 7.0 from February 2022 addressed the issue (although not yet confirmed). As far as I can see, osparc-simcore is still using aio-pika v6.8.0, i.e. perhaps their fix/workaround could be avoided by upgrading. kiwipy actually does not lock down the aio-pika version https://github.com/aiidateam/kiwipy/blob/adf373e794ed69d5ec21d4875514971f32d7734f/setup.py#L42 , but plumpy does and aiida-core as well Line 29 in bc73d61
I'll start by opening a PR against plumpy to see whether the upgrade works. Locking down the aio-pika version on the aiida-core level actually doesn't make a lot of sense to me - the only direct use of @sphuber does that make sense? Edit: There seem to be some API changes that affect plumpy aiidateam/plumpy#238 . Haven't looked into it yet. |
Yeah that makes sense to me. Let's update in |
Super interesting results of the attempt I try to bump the version of
|
Update: support for the newer |
I have tried a stress-test of the new AiiDA daemon after the replacement of tornado with asyncio.
I have updated to the most recent develop (716a1d8), updated also aiida-qe to develop (commit 1a9713aefbcd235c20ecfb65d0df226b5544bf7d of that repo), pip installed both, run
reentry scan
, and stopped+started the daemon.I try to roughly describe also what I've been doing.
Then, roughly, I have prepared a script to submit something of the order of ~2000 relax workflows.
While the submission was happening, I quickly reached the number of slots (warning message at the end of verdi process list indicating a % > 100%), so I did
verdi daemon incr 7
to work with 8 workers.After having submitted more than half of the workflows, I stopped because anyway 8 workers weren't enough and I didn't want to overload the supercomputer with too many connections from too many workers.
I left it run overnight, the next morning I was in a stalled situation all slots were taken, so I increased a bit more the workers, and after a while submitted the rest of the workflows, and let them finish.
Since I realised that most were excepting (see below), I also stopped the daemon (that took a bit, made sure it was stopped, and started again with just one worker to finish the work.
I have seen a number of issues unfortunately ( :-( ) where most calculations had some kind of problem. Pinging @sphuber @unkcpz @muhrin as they have been working on this so they should be able to help debugging/fixing the bugs.
I am going report below as different comments some of the issues that I'm seeing, but I'm not sure how to debug more, so if you need specific logs please let me know what to run (or @sphuber I can give you temporarily access to the machine if it's easier).
While I write I have the last few (~30) jobs finishing, but I can already start reporting the issues I see.
The text was updated successfully, but these errors were encountered: