-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DLQ flake #814
Fix DLQ flake #814
Conversation
@ikvmw: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov Report
@@ Coverage Diff @@
## main #814 +/- ##
==========================================
- Coverage 74.30% 73.96% -0.34%
==========================================
Files 39 39
Lines 2506 2520 +14
==========================================
+ Hits 1862 1864 +2
- Misses 577 588 +11
- Partials 67 68 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes lgtm. just a question about overriding the ERL_MAX_PORTS
containers: | ||
- name: rabbitmq | ||
env: | ||
- name: ERL_MAX_PORTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mentioned it today during retro - rabbitmq/cluster-operator#959
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we're not experiencing a crash loop with the node not starting though right? Our RMQ version is also newer (3.10) I believe so maybe that's why we're bypassing the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't depend on RMQ version. it's a combination of host setup and memory limit. When Erlang starts it tries to allocate memory stuctures for all available FDs. For example:
cat /proc/sys/fs/file-max
9223372036854775807
So the fact that it works locally and on GH is a pure luck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these run in Kind and looking at the control-plane there, I get:
root@knative-control-plane:/# ulimit -n
1048576
Is 1048576 too high? That seems fine for 1GB of RAM. Also doesn't 4096 feel too low?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can just barely grok what's happening with Erlang here. I understand that with a too high FD limit, Erlang can cause RMQ broker to OOM as it tries to allocate something per FD.
But it does look like this change isn't directly tied to the DLQ flake we're seeing in conformance tests. Maybe we can omit this change for this PR to unblock it and then open something separate to discuss this? We also have RMQClusters defined elsewhere in docs and setup instructions so we'll likely need a broader change than just the test clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can just barely grok what's happening with Erlang here. I understand that with a too high FD limit, Erlang can cause RMQ broker to OOM as it tries to allocate something per FD.
you got it completely right. we have to limit the subset erlang sees because we don't control test environments. it's like restricting max file size while doing uploads handler, etc.
Having really high fd limit is not a problem of course. The problem is this particular software pattern of preallocating stuff.
It's of course possible for me to have this change in a separate PR. However, on my system tests can't be run without capping ERL_MAX_PRTS. So that's why it goes as a package.
- name: rabbitmq | ||
env: | ||
- name: ERL_MAX_PORTS | ||
value: "4096" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mentioned it today during retro - rabbitmq/cluster-operator#959
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gabo1208, ikvmw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Just a comment but seems good to me |
removed ports fix @gab-satchi. will open another pr |
/lgtm |
/kind bug
Fixes #792