Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test: X-Pack Alerting API Integration Tests.x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired·ts - Observability Rules Rules Endpoints Custom Threshold rule RATE - GROUP_BY - BYTES - FIRED Rule creation should be active #176401

Closed
kibanamachine opened this issue Feb 7, 2024 · 11 comments
Assignees
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team:obs-ux-management Observability Management User Experience Team

Comments

@kibanamachine
Copy link
Contributor

A test failed on a tracked branch

Error: Timeout of 360000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/var/lib/buildkite-agent/builds/kb-n2-4-spot-8fac66d24e5b8d43/elastic/kibana-on-merge/kibana/x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired.ts)
    at listOnTimeout (node:internal/timers:573:17)
    at processTimers (node:internal/timers:514:7) {
  code: 'ERR_MOCHA_TIMEOUT',
  timeout: 360000,
  file: '/var/lib/buildkite-agent/builds/kb-n2-4-spot-8fac66d24e5b8d43/elastic/kibana-on-merge/kibana/x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired.ts'
}

First failure: CI Build - main

@kibanamachine kibanamachine added the failed-test A test failure on a tracked branch, potentially flaky-test label Feb 7, 2024
@botelastic botelastic bot added the needs-team Issues missing a team label label Feb 7, 2024
@kibanamachine kibanamachine added the Team:obs-ux-management Observability Management User Experience Team label Feb 7, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Feb 7, 2024
@maryam-saeidi maryam-saeidi self-assigned this Mar 5, 2024
@maryam-saeidi
Copy link
Member

Here is a History of this failure, it seems to me a one time issue, I am checking in Slack to see if someone has another opinion about it.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Mar 6, 2024

Here are my findings from the investigation of this test failure: (Slack 1, Slack 2)

  1. This issue happened only one time, during a couple of hours, once in the main and 3 times in other branches: (History of this failure)

    image

  2. For all of the above cases, the step that is failing is related to checking whether the rule is active or not.

  3. Usually, the check looks something like this: (which takes a couple of seconds)

    But in this failure, this step takes more than 5 minutes without logging any retry:

    • This can be related to exponential backoff stated here and not logging the attempts until either timeout or successful try.

      The example below will retry a potentially failing dns.resolve operation 10 times using an exponential backoff strategy. With the default settings, this means the last attempt is made after 17 minutes and 3 seconds.

    image

  4. Here are some ideas from @dmlemeshko:
    The test should not got "hanged" for 6+ minutes and fail on timeout. Possible reasons:

    • retry (multiple retries) with overall timeout exceeding 6 min suite (mocha) timeout
    • API calls using supertest without timeout
    • unhandled Promise(s)
      Also, describe block has 6 min timeout.
  5. Based on what was discussed so far, it seems the timeout is not set properly for our test.

I've started checking what is the current timeout value and how we can adjust it for our use case, here are the related findings:
@wayneseymour helped with finding out what the timeout for this test, apparently it is 10 mins

so, the config for that test, points to a function, that points to x-pack/test/api_integration/config.ts and this file points to x-pack/test/functional/config.base.js .
This js file uses values from test/common/config.js && test/functional/config.base.js .

When running the test server with -v option:

node scripts/functional_tests_server.js --config x-pack/test/alerting_api_integration/observability/config.ts -v

Command for running the test:

node scripts/functional_test_runner --config=x-pack/test/alerting_api_integration/observability/config.ts --include=x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired.ts -v

I see the following timeout values: (Not sure which one is applicable in our case)

 proc [kbn-ui] [2024-03-06T12:26:34.693+01:00][INFO ][plugins.securitySolution.endpoint:user-artifact-packager:1.0.0] Registering endpoint:user-artifact-packager task with timeout of [20m], interval of [60s] and policy update batch size of [25] {"service":{"node":{"roles":["ui"]}}}
 proc [kbn-tasks] [2024-03-06T12:26:32.868+01:00][INFO ][plugins.securitySolution.endpoint:user-artifact-packager:1.0.0] Registering endpoint:user-artifact-packager task with timeout of [20m], interval of [60s] and policy update batch size of [25] {"service":{"node":{"roles":["background_tasks"]}}}
 proc [kbn-tasks] [2024-03-06T12:26:38.061+01:00][INFO ][plugins.fleet.fleet:check-deleted-files-task:1.0.1] Started with interval of [1d] and timeout of [2m] {"service":{"node":{"roles":["background_tasks"]}}}
 proc [kbn-ui] [2024-03-06T12:26:37.506+01:00][INFO ][plugins.fleet.fleet:check-deleted-files-task:1.0.1] Started with interval of [1d] and timeout of [2m] {"service":{"node":{"roles":["ui"]}}}

Update: -v does not work for API integration tests apparently, the misunderstanding was that this is a functional test.

Since we are using supertest for api call (await supertest.get(/api/alerting/rule/${id})), I also checked how we provide this service and it seems it gets the config from config service:

export function KibanaSupertestProvider({ getService }: FtrProviderContext) {
  const config = getService('config');
  const kibanaServerUrl = formatUrl(config.get('servers.kibana'));
  return supertest(kibanaServerUrl);
}

I will investigate more about the timeouts to see how they can be adjusted for our case.

Action item

@benakansara
Copy link
Contributor

@maryam-saeidi I don't think the issue is related to timeout, but rather data ingestion.

By looking at the logs of success test and failure test, I can see the below difference.

  • in case of failure, the rule status remained "ok" because the rule generated "no data" alert.
  • in case of success, the rule status becomes "active" because the rule generated alert on data.

Success
[00:00:47] │ proc [kbn-tasks] [2024-02-07T13:21:22.010+00:00][INFO ][plugins.eventLog] event logged: {"@timestamp":"2024-02-07T13:21:22.009Z","event":{"provider":"alerting","action":"new-instance","kind":"alert","category":["observability"],"start":"2024-02-07T13:21:21.656Z","duration":"0"},"kibana":{"alert":{"flapping":false,"uuid":"6dc89a90-852b-44e5-9d31-3e1686aa6f2b","rule":{"rule_type_id":"observability.rules.custom_threshold","consumer":"logs","execution":{"uuid":"6ca62843-94ae-4410-ab73-8f2c706227aa"}}},"alerting":{"instance_id":"host-0,container-0","action_group_id":"custom_threshold.fired"},"saved_objects":[{"rel":"primary","type":"alert","id":"d34e8761-0ec2-4067-bb43-18640ebe5f89","type_id":"observability.rules.custom_threshold"}],"space_ids":["default"],"server_uuid":"00000000-2785-441b-ae8c-186a1936b17d","version":"8.13.0"},"message":"observability.rules.custom_threshold:d34e8761-0ec2-4067-bb43-18640ebe5f89: 'Threshold rule' created new alert: 'host-0,container-0'","rule":{"id":"d34e8761-0ec2-4067-bb43-18640ebe5f89","license":"basic","category":"observability.rules.custom_threshold","ruleset":"observability","name":"Threshold rule"},"ecs":{"version":"1.8.0"}} {"service":{"node":{"roles":["background_tasks"]}}}

Failure
[00:05:48] │ proc [kbn-tasks] [2024-02-07T13:11:55.322+00:00][INFO ][plugins.eventLog] event logged: {"@timestamp":"2024-02-07T13:11:55.321Z","event":{"provider":"alerting","action":"new-instance","kind":"alert","category":["observability"],"start":"2024-02-07T13:11:55.082Z","duration":"0"},"kibana":{"alert":{"flapping":false,"uuid":"5b7df5f1-d307-4396-9ba0-a6384b7338a5","rule":{"rule_type_id":"observability.rules.custom_threshold","consumer":"logs","execution":{"uuid":"ad4a508e-8cd6-48a9-8cf4-8d767e5bd9bf"}}},"alerting":{"instance_id":"host-0,container-0","action_group_id":"custom_threshold.nodata"},"saved_objects":[{"rel":"primary","type":"alert","id":"9a5b2624-0895-4dd0-9a71-a26aabd21e50","type_id":"observability.rules.custom_threshold"}],"space_ids":["default"],"server_uuid":"00000000-2785-441b-ae8c-186a1936b17d","version":"8.13.0"},"message":"observability.rules.custom_threshold:9a5b2624-0895-4dd0-9a71-a26aabd21e50: 'Threshold rule' created new alert: 'host-0,container-0'","rule":{"id":"9a5b2624-0895-4dd0-9a71-a26aabd21e50","license":"basic","category":"observability.rules.custom_threshold","ruleset":"observability","name":"Threshold rule"},"ecs":{"version":"1.8.0"}} {"service":{"node":{"roles":["background_tasks"]}}}

IMO the test failure is related to -

  • test data not being available at the time rule ran, something might be interfering with data ingestion but not sure what
  • in case of no data alert, rule status is "ok" which seems wrong

I was able to reproduce the issue couple of times in my local env. It could be because either my mac is slow or I had another instance of ES and kibana running (not sure how it is affecting data ingestion though).
@maryam-saeidi could you check if you can reproduce the test failure with keeping ES and kibana running locally and then starting test servers?

@maryam-saeidi
Copy link
Member

I don't think the issue is related to timeout, but rather data ingestion.

@benakansara I didn't say the issue is timeout itself but rather in case of timeout, we wait a long time to report back, which we could have failed the test sooner and maybe log more information to understand why that happened.

test data not being available at the time rule ran, something might be interfering with data ingestion but not sure what

That would be my general guess as well, but the question is why that happened and why only for this test.

in case of no data alert, rule status is "ok" which seems wrong

Do you have a log for that scenario? If you check my earlier comment, the issue is not receiving ok status but not receiving it at all as a result of timeout.

could you check if you can reproduce the test failure with keeping ES and kibana running locally and then starting test servers?

I will check it.

@benakansara
Copy link
Contributor

benakansara commented Mar 6, 2024

We could definitely fail the test sooner. 👍 My concern was that even if we make timeout shorter, the test will fail if the same scenario arises that caused the test to fail in the first place which seems to be related to data. If we can find out why some errors are not logged, that would be indeed helpful too.

If you check my earlier comment, the issue is not receiving ok status but not receiving it at all as a result of timeout.

could you point me to the comment/log you are referring? is it the screenshot? The test failure shows timeout, but it is actually failing by not receiving "active" status after all the retries.

Do you have a log for that scenario?

I checked the rule status with console.log when this failure happened.

@benakansara
Copy link
Contributor

btw once this failure starts, in all subsequent runs, I see should set correct action variables test failing as it was also reported in the issue.

@maryam-saeidi
Copy link
Member

maryam-saeidi commented Mar 12, 2024

We could definitely fail the test sooner. 👍 My concern was that even if we make timeout shorter, the test will fail if the same scenario arises that caused the test to fail in the first place which seems to be related to data. If we can find out why some errors are not logged, that would be indeed helpful too.

Again, I think we are saying the same thing, timeout, and logging improvement are not the solution for the current issue but a way to have better tests and hopefully more info to debug the issue next time if it happens.

Even if locally we find a way to reproduce the error by making ES inaccessible, it does not mean we can do much about it as we rely on having a stable connection to ES and that part is not in our control. Also, we cannot be sure that this is the same issue as the current failing test, it might look similar but it does not guarantee that we found the issue. The strange part is that this happened only 1 time, which makes it hard to track it down. Hopefully, better logging helps in this case.
Still, if there is a way to reproduce it consistently, let me know, or if you have an idea about how to fix it and you can create a PR for it, even better.

could you point me to the comment/log you are referring? is it the screenshot?

Yes, step 3 in my comment. I am creating a PR to replace pRetry with a function from retryService to give us more logs about whether it is one try that takes that long or multiple retries and we are only missing the logs.

btw once this failure starts, in all subsequent runs, I see should set correct action variables test failing as it was also reported in the issue.

I checked this issue and tests are running with the bail option and what you noticed is from 2 different tests run. I've adjusted the data to fix the second issue here.

Regarding the OK status, as we discussed in Slack, it is due to not having an active alert for that rule, so we need to check why the alert was not generated as expected, is the issue during data generation or rule execution or ...

@benakansara
Copy link
Contributor

I checked this issue and tests are running with the bail option and what you noticed is from 2 different tests. I've adjusted the data to fix the second issue here.

I think the CI runs the tests twice. In the first run, current test ("should be active") failed, and in second run, "should set correct action variables" failed. I wanted to share that I could see same happening when running tests locally.

Screenshot 2024-03-12 at 20 43 38

Even if locally we find a way to reproduce the error by making ES inaccessible, it does not mean we can do much about it as we rely on having a stable connection to ES and that part is not in our control.

I am not sure I follow what you mean by making ES inaccessible. I think we are not cleaning up the data properly after tests ran and that leaves the indices in a state where some tests could fail. I am not sure why it happened only one time. In order to fix the test, we need to be able to reproduce the behavior somehow. I noticed that if I kept local ES instance running along with test servers, the test failed. I tried cleaning up the indices before tests, but still the test was failing sometimes. I think data improvements you are doing in this PR will bring some consistency in how tests execute. Thanks for improving this and the logging. 👍

I agree with having more and meaningful logs, it will help us to investigate and find root cause in future if test fails.

@mgiota
Copy link
Contributor

mgiota commented Mar 12, 2024

I am gonna follow what you describe here, because we have a similar failing test #173653

maryam-saeidi added a commit that referenced this issue Mar 19, 2024
…hod from retryService (#178515)

Related to #176401, #175776

## Summary

This PR:

- Improves logging (I've added debug logs to the helpers that does an
API request such as creating a data view)
- Uses retryService instead of pRetry
- In case of throwing an error in pRetry, when we have 10 retries, it
does not log the retry attempts and we end up in the situation that is
mentioned in this [comment, item
3](#176401 (comment))
    
|Before|After|
|---|---|

|![image](https://github.com/elastic/kibana/assets/12370520/576146f2-09da-4221-a570-6d47e047f229)|![image](https://github.com/elastic/kibana/assets/12370520/0a0897a3-0bd3-4d44-9b79-8f99fb580b4a)|
- Attempts to fix flakiness in rate reason message due to having
different data

![image](https://github.com/elastic/kibana/assets/12370520/dff48ac1-a9bf-4b93-addb-fd40acae382e)


### Flaky test runner
#### Current (after adding refresh index and adjusting timeout)
- [25]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5463
✅
- [200]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5465
✅

#### Old
- [25]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5452
✅
- [200]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5454
[1 Failed : 25 Canceled: 174 Passed ]
  ##### After checking data is generated in metric threshold
- [25]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5460
✅
- [200]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5462
[1 Failed : 199 Canceled ]

Inspired by #173998, special
thanks to @jpdjere and @dmlemeshko for their support and knowledge
sharing.
@maryam-saeidi
Copy link
Member

I made some improvements in this PR and will close this issue as I didn't see a failure when I ran the test 200 times after the improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

No branches or pull requests

5 participants