Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DLQ flake #814

Merged
merged 1 commit into from
Jun 14, 2022
Merged

Fix DLQ flake #814

merged 1 commit into from
Jun 14, 2022

Conversation

ikavgo
Copy link
Contributor

@ikavgo ikavgo commented Jun 13, 2022

  • Separates queue and policy reconciliation to prevent a race condition on RMQ server.

/kind bug

Fixes #792

@knative-prow
Copy link

knative-prow bot commented Jun 13, 2022

@ikvmw: The label(s) kind/<kind> cannot be applied, because the repository doesn't have them.

In response to this:

Changes

/kind

Fixes #

Release Note


Docs


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 13, 2022
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 13, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2022
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 13, 2022
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2022
@codecov
Copy link

codecov bot commented Jun 13, 2022

Codecov Report

Merging #814 (d6e0a74) into main (d7e48d6) will decrease coverage by 0.33%.
The diff coverage is 10.25%.

❗ Current head d6e0a74 differs from pull request most recent head 963417c. Consider uploading reports for the commit 963417c to get more accurate results

@@            Coverage Diff             @@
##             main     #814      +/-   ##
==========================================
- Coverage   74.30%   73.96%   -0.34%     
==========================================
  Files          39       39              
  Lines        2506     2520      +14     
==========================================
+ Hits         1862     1864       +2     
- Misses        577      588      +11     
- Partials       67       68       +1     
Impacted Files Coverage Δ
pkg/rabbit/service.go 11.36% <0.00%> (-0.18%) ⬇️
pkg/reconciler/trigger/trigger.go 62.34% <22.22%> (-1.72%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7e48d6...963417c. Read the comment docs.

Copy link
Contributor

@gab-satchi gab-satchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes lgtm. just a question about overriding the ERL_MAX_PORTS

containers:
- name: rabbitmq
env:
- name: ERL_MAX_PORTS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned it today during retro - rabbitmq/cluster-operator#959

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we're not experiencing a crash loop with the node not starting though right? Our RMQ version is also newer (3.10) I believe so maybe that's why we're bypassing the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't depend on RMQ version. it's a combination of host setup and memory limit. When Erlang starts it tries to allocate memory stuctures for all available FDs. For example:

cat /proc/sys/fs/file-max 
9223372036854775807

So the fact that it works locally and on GH is a pure luck

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these run in Kind and looking at the control-plane there, I get:

root@knative-control-plane:/# ulimit -n
1048576

Is 1048576 too high? That seems fine for 1GB of RAM. Also doesn't 4096 feel too low?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just barely grok what's happening with Erlang here. I understand that with a too high FD limit, Erlang can cause RMQ broker to OOM as it tries to allocate something per FD.

But it does look like this change isn't directly tied to the DLQ flake we're seeing in conformance tests. Maybe we can omit this change for this PR to unblock it and then open something separate to discuss this? We also have RMQClusters defined elsewhere in docs and setup instructions so we'll likely need a broader change than just the test clusters.

Copy link
Contributor Author

@ikavgo ikavgo Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just barely grok what's happening with Erlang here. I understand that with a too high FD limit, Erlang can cause RMQ broker to OOM as it tries to allocate something per FD.

you got it completely right. we have to limit the subset erlang sees because we don't control test environments. it's like restricting max file size while doing uploads handler, etc.

Having really high fd limit is not a problem of course. The problem is this particular software pattern of preallocating stuff.

It's of course possible for me to have this change in a separate PR. However, on my system tests can't be run without capping ERL_MAX_PRTS. So that's why it goes as a package.

- name: rabbitmq
env:
- name: ERL_MAX_PORTS
value: "4096"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned it today during retro - rabbitmq/cluster-operator#959

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2022
@knative-prow
Copy link

knative-prow bot commented Jun 13, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabo1208, ikvmw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gabo1208
Copy link
Contributor

Just a comment but seems good to me

@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2022
@ikavgo
Copy link
Contributor Author

ikavgo commented Jun 14, 2022

removed ports fix @gab-satchi. will open another pr

@gab-satchi gab-satchi changed the title WIP: Fix dlq flake Fix DLQ flake Jun 14, 2022
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2022
@gab-satchi
Copy link
Contributor

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2022
@knative-prow knative-prow bot merged commit 68a9b30 into knative-extensions:main Jun 14, 2022
@ikavgo ikavgo mentioned this pull request Jul 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix DeliverySpec flake
4 participants