-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired·ts - Observability Rules Rules Endpoints Custom Threshold rule RATE - GROUP_BY - BYTES - FIRED Rule creation should be active #176401
Comments
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
Here is a History of this failure, it seems to me a one time issue, I am checking in Slack to see if someone has another opinion about it. |
Here are my findings from the investigation of this test failure: (Slack 1, Slack 2)
I've started checking what is the current timeout value and how we can adjust it for our use case, here are the related findings:
When running the test server with
Command for running the test:
I see the following timeout values: (Not sure which one is applicable in our case)
Update: Since we are using supertest for api call (await supertest.get(
I will investigate more about the timeouts to see how they can be adjusted for our case. Action item
|
@maryam-saeidi I don't think the issue is related to timeout, but rather data ingestion. By looking at the logs of success test and failure test, I can see the below difference.
Success Failure IMO the test failure is related to -
I was able to reproduce the issue couple of times in my local env. It could be because either my mac is slow or I had another instance of ES and kibana running (not sure how it is affecting data ingestion though). |
@benakansara I didn't say the issue is timeout itself but rather in case of timeout, we wait a long time to report back, which we could have failed the test sooner and maybe log more information to understand why that happened.
That would be my general guess as well, but the question is why that happened and why only for this test.
Do you have a log for that scenario? If you check my earlier comment, the issue is not receiving ok status but not receiving it at all as a result of timeout.
I will check it. |
We could definitely fail the test sooner. 👍 My concern was that even if we make timeout shorter, the test will fail if the same scenario arises that caused the test to fail in the first place which seems to be related to data. If we can find out why some errors are not logged, that would be indeed helpful too.
could you point me to the comment/log you are referring? is it the screenshot? The test failure shows timeout, but it is actually failing by not receiving "active" status after all the retries.
I checked the rule status with |
btw once this failure starts, in all subsequent runs, I see |
Again, I think we are saying the same thing, timeout, and logging improvement are not the solution for the current issue but a way to have better tests and hopefully more info to debug the issue next time if it happens. Even if locally we find a way to reproduce the error by making ES inaccessible, it does not mean we can do much about it as we rely on having a stable connection to ES and that part is not in our control. Also, we cannot be sure that this is the same issue as the current failing test, it might look similar but it does not guarantee that we found the issue. The strange part is that this happened only 1 time, which makes it hard to track it down. Hopefully, better logging helps in this case.
Yes, step 3 in my comment. I am creating a PR to replace pRetry with a function from retryService to give us more logs about whether it is one try that takes that long or multiple retries and we are only missing the logs.
I checked this issue and tests are running with the bail option and what you noticed is from 2 different tests run. I've adjusted the data to fix the second issue here. Regarding the OK status, as we discussed in Slack, it is due to not having an active alert for that rule, so we need to check why the alert was not generated as expected, is the issue during data generation or rule execution or ... |
I think the CI runs the tests twice. In the first run, current test ("should be active") failed, and in second run, "should set correct action variables" failed. I wanted to share that I could see same happening when running tests locally.
I am not sure I follow what you mean by making ES inaccessible. I think we are not cleaning up the data properly after tests ran and that leaves the indices in a state where some tests could fail. I am not sure why it happened only one time. In order to fix the test, we need to be able to reproduce the behavior somehow. I noticed that if I kept local ES instance running along with test servers, the test failed. I tried cleaning up the indices before tests, but still the test was failing sometimes. I think data improvements you are doing in this PR will bring some consistency in how tests execute. Thanks for improving this and the logging. 👍 I agree with having more and meaningful logs, it will help us to investigate and find root cause in future if test fails. |
I am gonna follow what you describe here, because we have a similar failing test #173653 |
…hod from retryService (#178515) Related to #176401, #175776 ## Summary This PR: - Improves logging (I've added debug logs to the helpers that does an API request such as creating a data view) - Uses retryService instead of pRetry - In case of throwing an error in pRetry, when we have 10 retries, it does not log the retry attempts and we end up in the situation that is mentioned in this [comment, item 3](#176401 (comment)) |Before|After| |---|---| |![image](https://github.com/elastic/kibana/assets/12370520/576146f2-09da-4221-a570-6d47e047f229)|![image](https://github.com/elastic/kibana/assets/12370520/0a0897a3-0bd3-4d44-9b79-8f99fb580b4a)| - Attempts to fix flakiness in rate reason message due to having different data ![image](https://github.com/elastic/kibana/assets/12370520/dff48ac1-a9bf-4b93-addb-fd40acae382e) ### Flaky test runner #### Current (after adding refresh index and adjusting timeout) - [25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5463 ✅ - [200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5465 ✅ #### Old - [25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5452 ✅ - [200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5454 [1 Failed : 25 Canceled: 174 Passed ] ##### After checking data is generated in metric threshold - [25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5460 ✅ - [200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5462 [1 Failed : 199 Canceled ] Inspired by #173998, special thanks to @jpdjere and @dmlemeshko for their support and knowledge sharing.
I made some improvements in this PR and will close this issue as I didn't see a failure when I ran the test 200 times after the improvement. |
A test failed on a tracked branch
First failure: CI Build - main
The text was updated successfully, but these errors were encountered: