Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AiiDA will no longer work with rabbitmq>3.7 by default #5105

Closed
chrisjsewell opened this issue Aug 30, 2021 · 20 comments
Closed

AiiDA will no longer work with rabbitmq>3.7 by default #5105

chrisjsewell opened this issue Aug 30, 2021 · 20 comments
Labels
priority/critical-blocking must be resolved before next release topic/rabbitmq type/bug type/wontfix apply only to closed issues

Comments

@chrisjsewell
Copy link
Member

chrisjsewell commented Aug 30, 2021

In rabbitmq/rabbitmq-server#2990 a consumer_timeout has been introduced and set to 15 minutes, meaning that any process task that takes longer than 15 minutes will be cancelled 😬
(there is people in that PR none too happy that this was introduced in a minor version)

The quick fix for this for users is either (a) use rabbitmq 3.7 or lower, or (b) configure consumer_timeout to false. (see also https://www.rabbitmq.com/consumers.html#acknowledgement-timeout)

As is literally the last comment in that PR, at the time of writing, it is unclear to me off-hand if this can be done using the API (i.e. something aiida-core can handle automatically)?

@chrisjsewell
Copy link
Member Author

I feel maybe we can put this in the broker_parameters:

Two question:

  1. will rabbitmq<3.8 complain if passed a parameter that it does not know?
  2. Can we actually set the default as false; in the documentation it implies it has to be an integer, but in the PR they specifically mention false Set a default for consumer_timeout to 15 minutes rabbitmq/rabbitmq-server#2990 (comment)

thoughts @sphuber?

@chrisjsewell
Copy link
Member Author

trying it out in #5106

@sphuber
Copy link
Contributor

sphuber commented Aug 30, 2021

I remember looking into the default timeouts a long time ago and I think it is not a value that can be configured from the client. This has to be configured on the server itself. There even was a maximum defined that could not be surpassed. So even if you put a value above it in the config, it would be capped at the hardcoded value. This may have been for older versions of RabbitMQ (around 3.5) and not sure if that is still there. All there reasoning is that the main use case for RabbitMQ is that these should be "quick" jobs on the order of seconds.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 30, 2021

yeh cheers #5106 does not appear to fail rabbitmq, but obviously no idea yet if it is actually having any affect

@chrisjsewell
Copy link
Member Author

Hmm, yeh no joy yet; trying to set consumer_timeout to 1 in #5106, but that doesn't seem to fail anything

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Aug 30, 2021

Yeh no I guess it is not part of https://www.rabbitmq.com/uri-query-parameters.html#tls 😒

I asked about adding it: rabbitmq/rabbitmq-server#2990 (comment), or maybe I should open an actual issue if they don't respond

@chrisjsewell
Copy link
Member Author

Ok opened: rabbitmq/rabbitmq-server#3344 🤞

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 6, 2021

Ok opened: rabbitmq/rabbitmq-server#3344 🤞

Well that was a dead end (we kinda use rabbitmq in a way it is not designed for)

So why don't we just remove it entirely 😉 chrisjsewell/aiida-process-coordinator#4

@giovannipizzi
Copy link
Member

I just had the same issue - Channel closed error for something running > 30 minutes. I checked and indeed I have rabbitmq 3.8.16.
We'll probably need to focus on replacing rmq as soon as 2.0 is out...
However, I'm sure many people will have this error in 2.0 as now recent versions are >3.7.

Can we make this requirement more obvious?
E.g. check in verdi status and print an error that the version of RMQ is not supported and one has to downgrade, at least for the time being?

@chrisjsewell
Copy link
Member Author

Adding link to another project encountering the same issue: celery/celery#6760

@tsthakur
Copy link

After accidentally getting my rabbitmq updated to 3.9.x I also faced this same issue. And I would like to point out that the simplest way to downgrade rabbitmq would be to use conda instead of debian package. Otherwise one needs to manually downgrade all dependencies like erlang which has its own dependencies and it creates a big mess.

So for anyone stumbling here, running following is all that's required.

conda install -c conda-forge rabbitmq-server=3.7.28

Maybe @giovannipizzi @chrisjsewell we can add this in the wiki where you discuss this issue?

@chrisjsewell
Copy link
Member Author

yeh, as we have just been discussing, I think it is a nicer solution, in terms of dependency management (as opposed to apt or homebrew), but the downside is no automated setup of a background service, using e.g. launchctl (osx), systemd (linux)

Out of interest, I have just posted here, to ask about such a feature https://groups.google.com/a/anaconda.com/g/anaconda/c/z36jZTlJG8g

@Zeleznyj
Copy link

Zeleznyj commented Nov 1, 2022

I've just had the issue with the channel closed error, while running the RabbitMQ v3.9.13. I have increased the consumer_timeout as per the documentation, but the jobs crashed after about 5 hours. I have some even older jobs running now, so I'm not sure if this is related to the timeout.

Going through the RabbitMQ documentation, I have noticed a possible mistake in the Aiida documentation. It suggests:

# 100 hours in milliseconds (increase if you expect your workflows to run longer)
consumer_timeout = 3600000

however this appears to actually correspond to 1 hour, which is also what the RabbitMQ documentation says.

@sphuber
Copy link
Contributor

sphuber commented Nov 1, 2022

Thanks for the report @Zeleznyj . Indeed, our wiki is incorrect and that is one hour, which would explain the error. Could you try to up it to lets say 3600000000 (a 1000 hours, just to be on the safe side) and restart the RabbitMQ service? Make sure to stop the daemon first and restart it when RabbitMQ is back up and running.

I will update the wiki now.

@Zeleznyj
Copy link

Zeleznyj commented Nov 1, 2022

I have tried increasing it, let's see if that helps, but the error is clearly somewhat random.

I have encountered the error before and thought it's related to this since I'm running Aiida on laptop, but this time the computer was on the whole time the jobs were running.

@ahkole
Copy link

ahkole commented Jan 27, 2023

Has anyone ever tried using the advanced.config to disable the timeout completely? The documentation (https://www.rabbitmq.com/consumers.html#acknowledgement-timeout) specifies that this should be possible by adding the following to a file named advanced.config:

%% advanced.config
[
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
].

@rikigigi
Copy link
Member

@ahkole I tried RabbitMQ 3.11.4 with the advanced config:

cat > ~/rabbitmq.notimeout.advanced.config <<EOF 
%% advanced.config
[
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
].
EOF
export RABBITMQ_ADVANCED_CONFIG_FILE=~/rabbitmq.notimeout.advanced.config
rabbitmq-server

and everything worked as expected

@khsrali
Copy link
Contributor

khsrali commented Oct 15, 2024

Ok, right now verdi status returns correct instructions

✔ version: AiiDA v2.6.2.post0
✔ config: /tmp/pytest-of-khosra_a/pytest-10/fc80b65e071f67ef50d89cc715645faa0/.aiida
✔ profile: temp-profilecore.sqlite_dos
✔ storage: SqliteDosStorage[/tmp/pytest-of-khosra_a/pytest-10/test_sqlite_version_core_sqlit0]: open,
Warning: RabbitMQ v3.12.1 is not supported and will cause unexpected problems!
Warning: It can cause long-running workflows to crash and jobs to be submitted multiple times.
Warning: See https://github.com/aiidateam/aiida-core/wiki/RabbitMQ-version-to-use for details.
✔ broker: RabbitMQ v3.12.1 @ amqp://guest:[email protected]:5672?heartbeat=600
⏺ daemon: The daemon is not running.

I don't know in which PR this was solved, but I think we can close in here..

@khsrali khsrali closed this as completed Oct 15, 2024
@khsrali
Copy link
Contributor

khsrali commented Oct 15, 2024

Alright, just found it. Adding here for the record:
#5317

@chrisjsewell
Copy link
Member Author

I don't know in which PR this was solved, but I think we can close in here..

well It's up to you, but... I would say that is the solution to the "symptom", not the underlying problem (that rabbitmq is absolutely is really not intended to be used this way) 😅

@agoscinski agoscinski added the type/wontfix apply only to closed issues label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-blocking must be resolved before next release topic/rabbitmq type/bug type/wontfix apply only to closed issues
Projects
None yet
Development

No branches or pull requests

9 participants