Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired·ts - Observability Rules Rules Endpoints Custom Threshold rule RATE - GROUP_BY - BYTES - FIRED Rule creation should be active #176401

kibanamachine · 2024-02-07T13:14:28Z

A test failed on a tracked branch

Error: Timeout of 360000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/var/lib/buildkite-agent/builds/kb-n2-4-spot-8fac66d24e5b8d43/elastic/kibana-on-merge/kibana/x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired.ts)
    at listOnTimeout (node:internal/timers:573:17)
    at processTimers (node:internal/timers:514:7) {
  code: 'ERR_MOCHA_TIMEOUT',
  timeout: 360000,
  file: '/var/lib/buildkite-agent/builds/kb-n2-4-spot-8fac66d24e5b8d43/elastic/kibana-on-merge/kibana/x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired.ts'
}

First failure: CI Build - main

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-02-07T13:14:51Z

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

maryam-saeidi · 2024-03-05T12:10:00Z

Here is a History of this failure, it seems to me a one time issue, I am checking in Slack to see if someone has another opinion about it.

maryam-saeidi · 2024-03-06T12:08:35Z

Here are my findings from the investigation of this test failure: (Slack 1, Slack 2)

This issue happened only one time, during a couple of hours, once in the main and 3 times in other branches: (History of this failure)
For all of the above cases, the step that is failing is related to checking whether the rule is active or not.
Usually, the check looks something like this: (which takes a couple of seconds)

But in this failure, this step takes more than 5 minutes without logging any retry:
- This can be related to exponential backoff stated here and not logging the attempts until either timeout or successful try.
  
  The example below will retry a potentially failing dns.resolve operation 10 times using an exponential backoff strategy. With the default settings, this means the last attempt is made after 17 minutes and 3 seconds.
Here are some ideas from @dmlemeshko:
The test should not got "hanged" for 6+ minutes and fail on timeout. Possible reasons:
- retry (multiple retries) with overall timeout exceeding 6 min suite (mocha) timeout
- API calls using supertest without timeout
- unhandled Promise(s)
  Also, describe block has 6 min timeout.
Based on what was discussed so far, it seems the timeout is not set properly for our test.

I've started checking what is the current timeout value and how we can adjust it for our use case, here are the related findings:
@wayneseymour helped with finding out what the timeout for this test, apparently it is 10 mins

so, the config for that test, points to a function, that points to x-pack/test/api_integration/config.ts and this file points to x-pack/test/functional/config.base.js .
This js file uses values from test/common/config.js && test/functional/config.base.js .

When running the test server with -v option:

node scripts/functional_tests_server.js --config x-pack/test/alerting_api_integration/observability/config.ts -v

Command for running the test:

node scripts/functional_test_runner --config=x-pack/test/alerting_api_integration/observability/config.ts --include=x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired.ts -v

I see the following timeout values: (Not sure which one is applicable in our case)

 proc [kbn-ui] [2024-03-06T12:26:34.693+01:00][INFO ][plugins.securitySolution.endpoint:user-artifact-packager:1.0.0] Registering endpoint:user-artifact-packager task with timeout of [20m], interval of [60s] and policy update batch size of [25] {"service":{"node":{"roles":["ui"]}}}
 proc [kbn-tasks] [2024-03-06T12:26:32.868+01:00][INFO ][plugins.securitySolution.endpoint:user-artifact-packager:1.0.0] Registering endpoint:user-artifact-packager task with timeout of [20m], interval of [60s] and policy update batch size of [25] {"service":{"node":{"roles":["background_tasks"]}}}
 proc [kbn-tasks] [2024-03-06T12:26:38.061+01:00][INFO ][plugins.fleet.fleet:check-deleted-files-task:1.0.1] Started with interval of [1d] and timeout of [2m] {"service":{"node":{"roles":["background_tasks"]}}}
 proc [kbn-ui] [2024-03-06T12:26:37.506+01:00][INFO ][plugins.fleet.fleet:check-deleted-files-task:1.0.1] Started with interval of [1d] and timeout of [2m] {"service":{"node":{"roles":["ui"]}}}

Update: -v does not work for API integration tests apparently, the misunderstanding was that this is a functional test.

Since we are using supertest for api call (await supertest.get(/api/alerting/rule/${id})), I also checked how we provide this service and it seems it gets the config from config service:

export function KibanaSupertestProvider({ getService }: FtrProviderContext) {
  const config = getService('config');
  const kibanaServerUrl = formatUrl(config.get('servers.kibana'));
  return supertest(kibanaServerUrl);
}

I will investigate more about the timeouts to see how they can be adjusted for our case.

Action item

Improve retry logic by replacing pRetry with a custom retry based on what the QA team provides.

benakansara · 2024-03-06T14:44:34Z

@maryam-saeidi I don't think the issue is related to timeout, but rather data ingestion.

By looking at the logs of success test and failure test, I can see the below difference.

in case of failure, the rule status remained "ok" because the rule generated "no data" alert.
in case of success, the rule status becomes "active" because the rule generated alert on data.

Success
[00:00:47] │ proc [kbn-tasks] [2024-02-07T13:21:22.010+00:00][INFO ][plugins.eventLog] event logged: {"@timestamp":"2024-02-07T13:21:22.009Z","event":{"provider":"alerting","action":"new-instance","kind":"alert","category":["observability"],"start":"2024-02-07T13:21:21.656Z","duration":"0"},"kibana":{"alert":{"flapping":false,"uuid":"6dc89a90-852b-44e5-9d31-3e1686aa6f2b","rule":{"rule_type_id":"observability.rules.custom_threshold","consumer":"logs","execution":{"uuid":"6ca62843-94ae-4410-ab73-8f2c706227aa"}}},"alerting":{"instance_id":"host-0,container-0","action_group_id":"custom_threshold.fired"},"saved_objects":[{"rel":"primary","type":"alert","id":"d34e8761-0ec2-4067-bb43-18640ebe5f89","type_id":"observability.rules.custom_threshold"}],"space_ids":["default"],"server_uuid":"00000000-2785-441b-ae8c-186a1936b17d","version":"8.13.0"},"message":"observability.rules.custom_threshold:d34e8761-0ec2-4067-bb43-18640ebe5f89: 'Threshold rule' created new alert: 'host-0,container-0'","rule":{"id":"d34e8761-0ec2-4067-bb43-18640ebe5f89","license":"basic","category":"observability.rules.custom_threshold","ruleset":"observability","name":"Threshold rule"},"ecs":{"version":"1.8.0"}} {"service":{"node":{"roles":["background_tasks"]}}}

Failure
[00:05:48] │ proc [kbn-tasks] [2024-02-07T13:11:55.322+00:00][INFO ][plugins.eventLog] event logged: {"@timestamp":"2024-02-07T13:11:55.321Z","event":{"provider":"alerting","action":"new-instance","kind":"alert","category":["observability"],"start":"2024-02-07T13:11:55.082Z","duration":"0"},"kibana":{"alert":{"flapping":false,"uuid":"5b7df5f1-d307-4396-9ba0-a6384b7338a5","rule":{"rule_type_id":"observability.rules.custom_threshold","consumer":"logs","execution":{"uuid":"ad4a508e-8cd6-48a9-8cf4-8d767e5bd9bf"}}},"alerting":{"instance_id":"host-0,container-0","action_group_id":"custom_threshold.nodata"},"saved_objects":[{"rel":"primary","type":"alert","id":"9a5b2624-0895-4dd0-9a71-a26aabd21e50","type_id":"observability.rules.custom_threshold"}],"space_ids":["default"],"server_uuid":"00000000-2785-441b-ae8c-186a1936b17d","version":"8.13.0"},"message":"observability.rules.custom_threshold:9a5b2624-0895-4dd0-9a71-a26aabd21e50: 'Threshold rule' created new alert: 'host-0,container-0'","rule":{"id":"9a5b2624-0895-4dd0-9a71-a26aabd21e50","license":"basic","category":"observability.rules.custom_threshold","ruleset":"observability","name":"Threshold rule"},"ecs":{"version":"1.8.0"}} {"service":{"node":{"roles":["background_tasks"]}}}

IMO the test failure is related to -

test data not being available at the time rule ran, something might be interfering with data ingestion but not sure what
in case of no data alert, rule status is "ok" which seems wrong

I was able to reproduce the issue couple of times in my local env. It could be because either my mac is slow or I had another instance of ES and kibana running (not sure how it is affecting data ingestion though).
@maryam-saeidi could you check if you can reproduce the test failure with keeping ES and kibana running locally and then starting test servers?

maryam-saeidi · 2024-03-06T15:37:50Z

I don't think the issue is related to timeout, but rather data ingestion.

@benakansara I didn't say the issue is timeout itself but rather in case of timeout, we wait a long time to report back, which we could have failed the test sooner and maybe log more information to understand why that happened.

test data not being available at the time rule ran, something might be interfering with data ingestion but not sure what

That would be my general guess as well, but the question is why that happened and why only for this test.

in case of no data alert, rule status is "ok" which seems wrong

Do you have a log for that scenario? If you check my earlier comment, the issue is not receiving ok status but not receiving it at all as a result of timeout.

could you check if you can reproduce the test failure with keeping ES and kibana running locally and then starting test servers?

I will check it.

benakansara · 2024-03-06T16:27:02Z

We could definitely fail the test sooner. 👍 My concern was that even if we make timeout shorter, the test will fail if the same scenario arises that caused the test to fail in the first place which seems to be related to data. If we can find out why some errors are not logged, that would be indeed helpful too.

If you check my earlier comment, the issue is not receiving ok status but not receiving it at all as a result of timeout.

could you point me to the comment/log you are referring? is it the screenshot? The test failure shows timeout, but it is actually failing by not receiving "active" status after all the retries.

Do you have a log for that scenario?

I checked the rule status with console.log when this failure happened.

benakansara · 2024-03-06T16:28:20Z

btw once this failure starts, in all subsequent runs, I see should set correct action variables test failing as it was also reported in the issue.

maryam-saeidi · 2024-03-12T16:58:12Z

We could definitely fail the test sooner. 👍 My concern was that even if we make timeout shorter, the test will fail if the same scenario arises that caused the test to fail in the first place which seems to be related to data. If we can find out why some errors are not logged, that would be indeed helpful too.

Again, I think we are saying the same thing, timeout, and logging improvement are not the solution for the current issue but a way to have better tests and hopefully more info to debug the issue next time if it happens.

Even if locally we find a way to reproduce the error by making ES inaccessible, it does not mean we can do much about it as we rely on having a stable connection to ES and that part is not in our control. Also, we cannot be sure that this is the same issue as the current failing test, it might look similar but it does not guarantee that we found the issue. The strange part is that this happened only 1 time, which makes it hard to track it down. Hopefully, better logging helps in this case.
Still, if there is a way to reproduce it consistently, let me know, or if you have an idea about how to fix it and you can create a PR for it, even better.

could you point me to the comment/log you are referring? is it the screenshot?

Yes, step 3 in my comment. I am creating a PR to replace pRetry with a function from retryService to give us more logs about whether it is one try that takes that long or multiple retries and we are only missing the logs.

btw once this failure starts, in all subsequent runs, I see should set correct action variables test failing as it was also reported in the issue.

I checked this issue and tests are running with the bail option and what you noticed is from 2 different tests run. I've adjusted the data to fix the second issue here.

Regarding the OK status, as we discussed in Slack, it is due to not having an active alert for that rule, so we need to check why the alert was not generated as expected, is the issue during data generation or rule execution or ...

benakansara · 2024-03-12T20:15:50Z

I checked this issue and tests are running with the bail option and what you noticed is from 2 different tests. I've adjusted the data to fix the second issue here.

I think the CI runs the tests twice. In the first run, current test ("should be active") failed, and in second run, "should set correct action variables" failed. I wanted to share that I could see same happening when running tests locally.

Even if locally we find a way to reproduce the error by making ES inaccessible, it does not mean we can do much about it as we rely on having a stable connection to ES and that part is not in our control.

I am not sure I follow what you mean by making ES inaccessible. I think we are not cleaning up the data properly after tests ran and that leaves the indices in a state where some tests could fail. I am not sure why it happened only one time. In order to fix the test, we need to be able to reproduce the behavior somehow. I noticed that if I kept local ES instance running along with test servers, the test failed. I tried cleaning up the indices before tests, but still the test was failing sometimes. I think data improvements you are doing in this PR will bring some consistency in how tests execute. Thanks for improving this and the logging. 👍

I agree with having more and meaningful logs, it will help us to investigate and find root cause in future if test fails.

mgiota · 2024-03-12T20:37:28Z

I am gonna follow what you describe here, because we have a similar failing test #173653

@jpdjere

…hod from retryService (#178515) Related to #176401, #175776 ## Summary This PR: - Improves logging (I've added debug logs to the helpers that does an API request such as creating a data view) - Uses retryService instead of pRetry - In case of throwing an error in pRetry, when we have 10 retries, it does not log the retry attempts and we end up in the situation that is mentioned in this [comment, item 3](#176401 (comment)) |Before|After| |---|---| |![image](https://github.com/elastic/kibana/assets/12370520/576146f2-09da-4221-a570-6d47e047f229)|![image](https://github.com/elastic/kibana/assets/12370520/0a0897a3-0bd3-4d44-9b79-8f99fb580b4a)| - Attempts to fix flakiness in rate reason message due to having different data ![image](https://github.com/elastic/kibana/assets/12370520/dff48ac1-a9bf-4b93-addb-fd40acae382e) ### Flaky test runner #### Current (after adding refresh index and adjusting timeout) - [25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5463 ✅ - [200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5465 ✅ #### Old - [25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5452 ✅ - [200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5454 [1 Failed : 25 Canceled: 174 Passed ] ##### After checking data is generated in metric threshold - [25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5460 ✅ - [200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5462 [1 Failed : 199 Canceled ] Inspired by #173998, special thanks to @jpdjere and @dmlemeshko for their support and knowledge sharing.

maryam-saeidi · 2024-03-19T14:48:06Z

I made some improvements in this PR and will close this issue as I didn't see a failure when I ran the test 200 times after the improvement.

kibanamachine added the failed-test A test failure on a tracked branch, potentially flaky-test label Feb 7, 2024

botelastic bot added the needs-team Issues missing a team label label Feb 7, 2024

kibanamachine added the Team:obs-ux-management Observability Management User Experience Team label Feb 7, 2024

botelastic bot removed the needs-team Issues missing a team label label Feb 7, 2024

maryam-saeidi self-assigned this Mar 5, 2024

This was referenced Mar 12, 2024

Improve logging in API integration test and replace pRetry with a method from retryService #178515

Merged

Adding a more flexible retry function to the retryService #178535

Closed

maryam-saeidi closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired·ts - Observability Rules Rules Endpoints Custom Threshold rule RATE - GROUP_BY - BYTES - FIRED Rule creation should be active #176401

Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired·ts - Observability Rules Rules Endpoints Custom Threshold rule RATE - GROUP_BY - BYTES - FIRED Rule creation should be active #176401

kibanamachine commented Feb 7, 2024

elasticmachine commented Feb 7, 2024

maryam-saeidi commented Mar 5, 2024

maryam-saeidi commented Mar 6, 2024 •

edited

Loading

benakansara commented Mar 6, 2024

maryam-saeidi commented Mar 6, 2024

benakansara commented Mar 6, 2024 •

edited

Loading

benakansara commented Mar 6, 2024

maryam-saeidi commented Mar 12, 2024 •

edited

Loading

benakansara commented Mar 12, 2024

mgiota commented Mar 12, 2024

maryam-saeidi commented Mar 19, 2024

Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired·ts - Observability Rules Rules Endpoints Custom Threshold rule RATE - GROUP_BY - BYTES - FIRED Rule creation should be active #176401

Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired·ts - Observability Rules Rules Endpoints Custom Threshold rule RATE - GROUP_BY - BYTES - FIRED Rule creation should be active #176401

Comments

kibanamachine commented Feb 7, 2024

elasticmachine commented Feb 7, 2024

maryam-saeidi commented Mar 5, 2024

maryam-saeidi commented Mar 6, 2024 • edited Loading

Action item

benakansara commented Mar 6, 2024

maryam-saeidi commented Mar 6, 2024

benakansara commented Mar 6, 2024 • edited Loading

benakansara commented Mar 6, 2024

maryam-saeidi commented Mar 12, 2024 • edited Loading

benakansara commented Mar 12, 2024

mgiota commented Mar 12, 2024

maryam-saeidi commented Mar 19, 2024

maryam-saeidi commented Mar 6, 2024 •

edited

Loading

benakansara commented Mar 6, 2024 •

edited

Loading

maryam-saeidi commented Mar 12, 2024 •

edited

Loading