-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unskip enrichments tests #171983
Unskip enrichments tests #171983
Conversation
@elasticmachine merge upstream |
/ci |
@elasticmachine merge upstream |
flaky tests 100 ess + 100 serverless https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4673 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love an explanation for the problem/solution here; without that, I don't see how we can have confidence that this will fix the flakiness.
Also, on my investigation branch I was observing the failures at about a 1/75 rate, so I think a flakey run of more than 100 would be appropriate here to eliminate the potential for a false positive (or would it be a false negative? Either way, you get my meaning).
@@ -37,7 +37,7 @@ const CURRENT_HOST_RISK_LEVEL = 'Current host risk level'; | |||
const ORIGINAL_HOST_RISK_LEVEL = 'Original host risk level'; | |||
|
|||
// FLAKY: https://github.com/elastic/kibana/issues/169154 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// FLAKY: https://github.com/elastic/kibana/issues/169154 |
@@ -31,10 +31,6 @@ | |||
}, | |||
"settings": { | |||
"index": { | |||
"lifecycle": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please provide an explanation for the issue that caused the test failures, and why removing these ILM settings fixes them? Other developers (myself included; this one was tricky) would benefit greatly from your findings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding, is that ILM is not supported in serverless, which mean they not supported in cypress tests, I put some related PR/error in description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea, that maybe it was partially fixed by this PR, when EA tests were separated to a different execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nkhristinin it looks like link you provided to the error has expired. Can you please post something more permanent here for posterity? That way if this test fails again, we can validate whether the hypothesized ILM error is to blame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE the ILM hypothesis: in my investigation I saw failures in ESS as well, so I think that invalidates ILM as the underlying issue.
I also don't think ILM explains the behavior I was observing: the test fails due to the presence of unexpected data (specifically, historical risk scores).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, what was with the link, but I changed it to the link of failing build in this pr.
The serverless tests are failing because of ILM, I am not sure if it's related to your initial investigation, or if it's a new problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not encountered that ILM error during my investigation, and since it's specific to serverless it cannot be the root cause of #169154.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not the root cause of those issues. After more deep look into that, this is my understanding of the timeline:
We did have some flaky tests, which you investigate in this PR
After some time, EA tests were moved to separate executions
When I started to work on those tests, there were constant test failures in serverless, not flaky tests, which this PR if fixed.
Then I ran several flaky test runs, which were successful for enrichments, but had failures not related to this test case:
x100 - errors related to dashboard
x200 - errors related to dashboard
And then I saw that there was one more x200 test run, which had 1 error (test 175) for enrichment tests, but this error also not related to original issues, but something that happened in before all
hook
Original error:
AssertionError: Timed out retrying after 150000ms: Expected not to find content: 'Original host risk level' within the element: <div.euiFlexGroup.StyledEuiFlexGroup-sc-1az7ye2-2.chTqHU.css-1a15qtm-euiFlexGroup-responsive-none-flexStart-center-row> but continuously found it.
at Context.eval (webpack:///./e2e/entity_analytics/enrichments.cy.ts:129:69)
New error:
CypressError:
cy.task('esArchiverUnload')
timed out after waiting60000ms
.
https://on.cypress.io/api/task
Because this error occurred during abefore all
hook we are skipping the remaining tests in the current suite:Enrichment
Conclusion:
We have not yet seen that original issue, but there were other flaky tests, so I will continue to run the flaky test runner, and we should have more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the last test run confirmed that it still flaky -
@elasticmachine merge upstream |
200 ESS + 200 serverless flaky test runner |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
250+247 flaky tests - https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4782 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes included in this branch, and the parallel changes discussed here, these tests look to be significantly less flaky (i.e. 1 failure in 1000). I'm happy to merge this, get our test coverage back, and iterate in the future if these do fail in the wild again.
Thanks for taking the time to dig deep here @nkhristinin !
One more flaky test run |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security-engineering-productivity changes LGTM
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
## Fix enrichments cypress tests [Related PR with removing ILM](elastic#167916) as it not supported in serverless. Here the [error](https://buildkite.com/elastic/kibana-pull-request/builds/179121) in the CI related to that --------- Co-authored-by: Kibana Machine <[email protected]>
Fix enrichments cypress tests
Related PR with removing ILM as it not supported in serverless. Here the error in the CI related to that