Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unskip enrichments tests #171983

Merged
merged 13 commits into from
Jan 12, 2024
Merged

Conversation

nkhristinin
Copy link
Contributor

@nkhristinin nkhristinin commented Nov 27, 2023

Fix enrichments cypress tests

Related PR with removing ILM as it not supported in serverless. Here the error in the CI related to that

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

/ci

@nkhristinin nkhristinin marked this pull request as ready for review December 22, 2023 10:26
@nkhristinin nkhristinin requested review from a team as code owners December 22, 2023 10:26
@nkhristinin nkhristinin added the release_note:skip Skip the PR/issue when compiling release notes label Dec 22, 2023
@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@nkhristinin
Copy link
Contributor Author

flaky tests 100 ess + 100 serverless https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4673

Copy link
Contributor

@rylnd rylnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love an explanation for the problem/solution here; without that, I don't see how we can have confidence that this will fix the flakiness.

Also, on my investigation branch I was observing the failures at about a 1/75 rate, so I think a flakey run of more than 100 would be appropriate here to eliminate the potential for a false positive (or would it be a false negative? Either way, you get my meaning).

@@ -37,7 +37,7 @@ const CURRENT_HOST_RISK_LEVEL = 'Current host risk level';
const ORIGINAL_HOST_RISK_LEVEL = 'Original host risk level';

// FLAKY: https://github.com/elastic/kibana/issues/169154
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// FLAKY: https://github.com/elastic/kibana/issues/169154

@@ -31,10 +31,6 @@
},
"settings": {
"index": {
"lifecycle": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please provide an explanation for the issue that caused the test failures, and why removing these ILM settings fixes them? Other developers (myself included; this one was tricky) would benefit greatly from your findings here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding, is that ILM is not supported in serverless, which mean they not supported in cypress tests, I put some related PR/error in description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea, that maybe it was partially fixed by this PR, when EA tests were separated to a different execution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkhristinin it looks like link you provided to the error has expired. Can you please post something more permanent here for posterity? That way if this test fails again, we can validate whether the hypothesized ILM error is to blame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the ILM hypothesis: in my investigation I saw failures in ESS as well, so I think that invalidates ILM as the underlying issue.

I also don't think ILM explains the behavior I was observing: the test fails due to the presence of unexpected data (specifically, historical risk scores).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, what was with the link, but I changed it to the link of failing build in this pr.

The serverless tests are failing because of ILM, I am not sure if it's related to your initial investigation, or if it's a new problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not encountered that ILM error during my investigation, and since it's specific to serverless it cannot be the root cause of #169154.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not the root cause of those issues. After more deep look into that, this is my understanding of the timeline:

We did have some flaky tests, which you investigate in this PR

After some time, EA tests were moved to separate executions

When I started to work on those tests, there were constant test failures in serverless, not flaky tests, which this PR if fixed.

Then I ran several flaky test runs, which were successful for enrichments, but had failures not related to this test case:

x100 - errors related to dashboard

x200 - errors related to dashboard

And then I saw that there was one more x200 test run, which had 1 error (test 175) for enrichment tests, but this error also not related to original issues, but something that happened in before all hook

Original error:

AssertionError: Timed out retrying after 150000ms: Expected not to find content: 'Original host risk level' within the element: <div.euiFlexGroup.StyledEuiFlexGroup-sc-1az7ye2-2.chTqHU.css-1a15qtm-euiFlexGroup-responsive-none-flexStart-center-row> but continuously found it.
at Context.eval (webpack:///./e2e/entity_analytics/enrichments.cy.ts:129:69)

New error:

CypressError: cy.task('esArchiverUnload') timed out after waiting 60000ms.
https://on.cypress.io/api/task
Because this error occurred during a before all hook we are skipping the remaining tests in the current suite: Enrichment

Conclusion:
We have not yet seen that original issue, but there were other flaky tests, so I will continue to run the flaky test runner, and we should have more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the last test run confirmed that it still flaky -

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

200 ESS + 200 serverless flaky test runner
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4679

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

@nkhristinin
Copy link
Contributor Author

Copy link
Contributor

@rylnd rylnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes included in this branch, and the parallel changes discussed here, these tests look to be significantly less flaky (i.e. 1 failure in 1000). I'm happy to merge this, get our test coverage back, and iterate in the future if these do fail in the wild again.

Thanks for taking the time to dig deep here @nkhristinin !

@nkhristinin
Copy link
Contributor Author

@nkhristinin
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Member

@MadameSheema MadameSheema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-engineering-productivity changes LGTM

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #1 / AssigneesFilterPopover calls onSelectionChange when 1 user is selected
  • [job] [logs] Jest Tests #1 / AssigneesFilterPopover calls onSelectionChange with a single user when different users are selected

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@nkhristinin nkhristinin merged commit d43e7e5 into elastic:main Jan 12, 2024
38 checks passed
@kibanamachine kibanamachine added v8.13.0 backport:skip This commit does not require backporting labels Jan 12, 2024
semd pushed a commit to semd/kibana that referenced this pull request Jan 12, 2024
## Fix enrichments cypress tests

[Related PR with removing
ILM](elastic#167916) as it not supported
in serverless. Here the
[error](https://buildkite.com/elastic/kibana-pull-request/builds/179121)
in the CI related to that

---------

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants