Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix interaction between interface timeout and RC POLL INTERVAL + Partial revert #2805 #2848

Merged
merged 9 commits into from
Aug 19, 2024

Conversation

cbeauchesne
Copy link
Collaborator

@cbeauchesne cbeauchesne commented Aug 5, 2024

Motivation

Being as close as possible from real conditions is a golden rules in functional testing.

The time gained by the change on DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS in #2805 wasn't computed.

Changes

Workflow

  1. ⚠️ Create your PR as draft ⚠️
  2. Work on you PR until the CI passes (if something not related to your task is failing, you can ignore it)
  3. Mark it as ready for review
    • Test logic is modified? -> Get a review from RFC owner. We're working on refining the codeowners file quickly.
    • Framework is modified, or non obvious usage of it -> get a review from R&P team

🚀 Once your PR is reviewed, you can merge it!

🛟 #apm-shared-testing 🛟

Reviewer checklist

  • If PR title starts with [<language>], double-check that only <language> is impacted by the change
  • No system-tests internal is modified. Otherwise, I have the approval from R&P team
  • CI is green, or failing jobs are not related to this change (and you are 100% sure about this statement)
  • A docker base image is modified?
    • the relevant build-XXX-image label is present
  • A scenario is added (or removed)?

@cbeauchesne
Copy link
Collaborator Author

Unfortunatly, partial revert is not possible, as DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS and library_interface_timeout seems to be interacting with each other, where they shouldn't.

We need to understand why the interact, and fix that. And if it's legit, choose values based on what why gain, to get the best compromise between optimization and being as close as possible from the real conditions.

@rochdev
Copy link
Member

rochdev commented Aug 5, 2024

We need to understand why the interact, and fix that.

This is out of scope for the values themselves. If there is a correlation, it already existed before.

The time gained by the change on DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS in #2805 wasn't computed.

We know what we gain as the timeout time is absolute, so it goes from 100 seconds to 10 seconds. This is significant time that is otherwise wasted for no reason. The gain is literally the timeout difference.

being as close as possible from the real conditions

This is real conditions, the requests are just made faster because we don't risk overloading the backend or the agent in tests.

I really don't see any actual compelling argument for reverting this. It will only slow down running the tests for theoretical academic reasons with no basis in the actual testing. The pragmatic approach is to make this fast if we can.

@cbeauchesne
Copy link
Collaborator Author

A flakyness oberved here https://github.com/DataDog/system-tests-dashboard/actions/runs/10260556706/job/28404990643#step:20:125

I don't remember seeing this issue previously. But can't affirm it's caused by #2805, but it's definitly on the short list.

@cbeauchesne
Copy link
Collaborator Author

Local tests on REMOTE_CONFIG_MOCKED_BACKEND_ASM_FEATURES / apache-mod-8.0 / dev shows the issue with or without #2805.

@cbeauchesne cbeauchesne changed the title Revert #2805 Fix interaction betwwen interface timeout and RC POLL INTERVAL + Partial revert #2805 Aug 7, 2024
@cbeauchesne cbeauchesne changed the title Fix interaction betwwen interface timeout and RC POLL INTERVAL + Partial revert #2805 Fix interaction between interface timeout and RC POLL INTERVAL + Partial revert #2805 Aug 7, 2024
@cbeauchesne cbeauchesne marked this pull request as ready for review August 7, 2024 14:04
@cbeauchesne cbeauchesne requested a review from a team as a code owner August 7, 2024 14:04
@cbeauchesne
Copy link
Collaborator Author

CI is ok (failures are not related)

@cbeauchesne
Copy link
Collaborator Author

The bug fixed in this PR caused a race condition that was very unlikely to happen with the previous values of DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS and library_interface_timeout. With values introduced in #2805 , the probability of flakiness in slightly bigger (how much?), without being a problem so far. I need to perform few more tests to have full visibility.

Though, if there is any doubt on issue related to remote config, I recommend to merge this PR before anything else, to remove a source of doubt.

@cbeauchesne
Copy link
Collaborator Author

Wait for #2894 for java failures

@cbeauchesne cbeauchesne merged commit 5da6c25 into main Aug 19, 2024
323 of 328 checks passed
@cbeauchesne cbeauchesne deleted the cbeauchesne/rv-2805 branch August 19, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants