test: passed test suite contains failed test cases #5123

chewong · 2021-07-20T18:11:12Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

example: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-capz-windows-dockershim/1417250279195152384

The job above has two failed test cases but it's marked as passed:

Test started yesterday at 3:31 PM passed after 1h24m54s. (more info)

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

jsturtevant · 2021-07-20T19:56:55Z

This appears to happen for Linux as well: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-capz-conformance/1416217627528794112

CecileRobertMichon · 2021-08-03T16:22:38Z

/priority important-soon
/help

k8s-ci-robot · 2021-08-03T16:22:39Z

@CecileRobertMichon:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/priority important-soon
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon · 2021-08-18T23:37:33Z

Looks like this is an issue in CAPI test framework, CAPA has the same problem: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-provider-aws-e2e-conformance-with-k8s-ci-artifacts/1422914264003252224

CecileRobertMichon · 2021-08-19T16:15:44Z

Transferred to CAPI repo as this is an issue in the CAPI test framework and affects CAPI as well https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-18-1-19.

I think ginkgo does not return a non-zero exit code and that’s why RunContainer does not fail (which is probably good because with our current code we would otherwise not gather the test results, and wouldn’t see them in Prow)
Imho, one solution could be to read the test results after:

cluster-api/test/framework/kubetest/run.go

Line 193 in e0ced24

return framework.GatherJUnitReports(reportDir, input.ArtifactsDirectory)

And if we find failed tests, return an error.

cc @sbueringer @randomvariable

CecileRobertMichon · 2021-08-19T16:15:59Z

/help

k8s-ci-robot · 2021-08-19T16:16:00Z

@CecileRobertMichon:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

randomvariable · 2021-08-19T17:15:11Z

Chat from Slack, looks like we're going to have to parse the JUnit XML and figure out if something failed, which is pretty terrible.

Philosophically in the initial implementation, the current behaviour was intended given that to get any sort of result from conformance was a big success in itself. Taking the next step in making a failed conformance test fail the suite is reasonable.

sbueringer · 2021-08-19T20:15:37Z

fyi. Prow is using GCP/testgrid to parse junit reports for the junit lens:
https://github.com/kubernetes/test-infra/blob/master/prow/spyglass/lenses/junit/lens.go#L162
https://github.com/GoogleCloudPlatform/testgrid/blob/master/metadata/junit/junit.go#L198-L204

No idea if it's a good idea to use it.

randomvariable · 2021-08-20T12:06:34Z

If it does what we want, i don't see why not.

sbueringer · 2021-08-20T12:36:42Z

If it does what we want, i don't see why not.

Just thinking about that we might don't want to have a dependency on testgrid and not sure how stable it is as a library (current verison is 0.0.91). But I guess as the worst case is to copy ~200 lines of code, it shouldn't hurt to have a dependency on it.

vincepri · 2021-09-30T14:42:41Z

We have the third_party folder where you could copy code from upstream repositories if needed.

/milestone v1.0
/priority important-soon

k8s-triage-robot · 2022-01-20T18:22:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2022-01-20T18:26:12Z

/lifecycle frozen

fabriziopandini · 2022-09-30T20:03:41Z

/triage accepted
worth to investigate if this is still the case

sbueringer · 2023-01-03T17:22:11Z

Definitely still the case.

This can be reproduced ~ like this:

Run an upgrade test locally with a breakpoint before we run kubetest
Break the cluster (e.g. change the image in the CoreDNS Deployment to an invalid name)
Wait 10m (or add system-pods-startup-timeout: 1m to conformance.yaml
Step through the remaining code. I would expect us to copy the junit reports around but based on the code we don't read and parse the results to fail the e2e test if there are any failures in the junit reports

cc @knabben (Just in case you're looking for more work in a bit :), I can help you getting the local e2e test up & running)

k8s-triage-robot · 2023-04-03T17:41:44Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2023-04-14T16:40:26Z

/triage accepted

enxebre · 2023-06-30T12:39:59Z

@sbueringer Can we close this now that kubernetes-sigs/cluster-api-provider-azure#2265 is solved?

sbueringer · 2023-06-30T12:57:16Z

Not sure about CAPZ I think they had other problems, but I think our problem is not solved.

Should be still reproducible via: #5123 (comment)

We didn't make any changes in core CAPI

k8s-triage-robot · 2024-01-19T01:00:50Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2024-04-22T13:15:04Z

/assign @chrischdi
To re-assess if we handle properly the results from conformance test

chrischdi · 2024-04-23T11:54:11Z

I tried to reproduce it again and @sbueringer already had the correct idea here:

Could it be that the ContainerInspect is looking into the newly restarted container instead of the "run" which just failed?

xref: fix in CAPI: #7946

source: kubernetes-sigs/cluster-api-provider-azure#2265

So before merging #7946 there was a race condition between reading the exit code and the restart triggered by docker, which cleared the exit code.

So if the ginkgo fails it should always return its non-zero exit code and because of that let the test fail.

One improvement which could maybe be done: still run the framework.GatherJUnitReports function to show the potential failed tests in prow instead of hiding away the junit report in the error case.
I will create a PR for this.

/close

Appendix:

I was able to reproduce the issue by using the instructions from here plus:

Removing the line at: https://github.com/kubernetes-sigs/cluster-api/blob/main/test/framework/kubetest/run.go#L204
Having a breakpoint at https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/container/docker.go#L536
- and let the debugger continue as soon as docker restarted the container

Note: this means that every usage of RunContainer which does not set RestartPolicy: dockercontainer.RestartPolicyDisabled may not able to rely on the exitcode check.

k8s-ci-robot · 2024-04-23T11:54:15Z

@chrischdi: Closing this issue.

In response to this:

I tried to reproduce it again and stefan already had the correct idea here:

Could it be that the ContainerInspect is looking into the newly restarted container instead of the "run" which just failed?

xref: fix in CAPI: #7946

source: kubernetes-sigs/cluster-api-provider-azure#2265

So before merging #7946 there was a race condition between reading the exit code and the restart triggered by docker, which cleared the exit code.

So if the ginkgo fails it should always return its non-zero exit code and because of that let the test fail.

One improvement which could maybe be done: still run the framework.GatherJUnitReports function to show the potential failed tests in prow instead of hiding away the junit report in the error case.
I will create a PR for this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2024-04-23T12:23:20Z

@chrischdi Just that I got it right. Current behavior:

If any conformance test fails, the test in Prow will fail as well
If any conformance test fails, we won't get the junit report in Prow (by which we would see what exactly failed)
Is that correct?

chrischdi · 2024-04-23T12:26:25Z

@chrischdi Just that I got it right. Current behavior:

If any conformance test fails, the test in Prow will fail as well

Exactly, the case described in this issue that the test "passed" was because of the race condition because docker restarted the container.

If any conformance test fails, we won't get the junit report in Prow (by which we would see what exactly failed)
Is that correct?

We would get them in prow, but hidden in artifacts as xml file / not visualised by prow. But this will be fixed via #10493

sbueringer · 2024-04-23T12:29:09Z

We would get them in prow, but hidden in artifacts as xml file / not visualised by prow. But this will be fixed via #10493

Not sure I got this part. I think we only have one xml file in artifacts, and that is the one that Prow uses to visualize: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-conformance-main/1780353942337622016/artifacts/

Are there more? (that are also uploaded to artifacts)

sbueringer · 2024-04-23T12:30:15Z

Ah, we move the files (in GatherJUnitReports)? So if the move doesn't happen they are still here? https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-conformance-main/1780353942337622016/artifacts/kubetest/k8s-conformance-zcv9fy/

chrischdi · 2024-04-23T12:32:25Z

Yeah, to make them visible to prow we have to move them up.

CecileRobertMichon transferred this issue from kubernetes-sigs/cluster-api-provider-azure Aug 19, 2021

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2021

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 19, 2021

k8s-ci-robot added this to the v1.0 milestone Sep 30, 2021

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 30, 2021

vincepri modified the milestones: v1.0, v1.1 Oct 22, 2021

jsturtevant mentioned this issue Dec 17, 2021

Improve Windows logging and general stability for upstream e2e tests kubernetes-sigs/cluster-api-provider-azure#1932

Merged

3 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2022

fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 30, 2022

sbueringer mentioned this issue Feb 2, 2023

conformance_test.go exit code to 0 event the non last test suite has failures kubernetes-sigs/cluster-api-provider-azure#2265

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Apr 3, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 14, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

fabriziopandini removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 16, 2024

k8s-ci-robot assigned chrischdi Apr 22, 2024

k8s-ci-robot closed this as completed Apr 23, 2024

chrischdi mentioned this issue Apr 23, 2024

🐛 kubetest: also gather junit reports in case of errors observed from ginkgo #10493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: passed test suite contains failed test cases #5123

test: passed test suite contains failed test cases #5123

chewong commented Jul 20, 2021

jsturtevant commented Jul 20, 2021

CecileRobertMichon commented Aug 3, 2021

k8s-ci-robot commented Aug 3, 2021

CecileRobertMichon commented Aug 18, 2021

CecileRobertMichon commented Aug 19, 2021

CecileRobertMichon commented Aug 19, 2021

k8s-ci-robot commented Aug 19, 2021

randomvariable commented Aug 19, 2021

sbueringer commented Aug 19, 2021

randomvariable commented Aug 20, 2021

sbueringer commented Aug 20, 2021

vincepri commented Sep 30, 2021

k8s-triage-robot commented Jan 20, 2022

fabriziopandini commented Jan 20, 2022

fabriziopandini commented Sep 30, 2022

sbueringer commented Jan 3, 2023 •

edited

Loading

k8s-triage-robot commented Apr 3, 2023

fabriziopandini commented Apr 14, 2023

enxebre commented Jun 30, 2023

sbueringer commented Jun 30, 2023

k8s-triage-robot commented Jan 19, 2024

fabriziopandini commented Apr 22, 2024

chrischdi commented Apr 23, 2024 •

edited

Loading

k8s-ci-robot commented Apr 23, 2024

sbueringer commented Apr 23, 2024

chrischdi commented Apr 23, 2024

sbueringer commented Apr 23, 2024

sbueringer commented Apr 23, 2024 •

edited

Loading

chrischdi commented Apr 23, 2024

test: passed test suite contains failed test cases #5123

test: passed test suite contains failed test cases #5123

Comments

chewong commented Jul 20, 2021

jsturtevant commented Jul 20, 2021

CecileRobertMichon commented Aug 3, 2021

k8s-ci-robot commented Aug 3, 2021

CecileRobertMichon commented Aug 18, 2021

CecileRobertMichon commented Aug 19, 2021

CecileRobertMichon commented Aug 19, 2021

k8s-ci-robot commented Aug 19, 2021

randomvariable commented Aug 19, 2021

sbueringer commented Aug 19, 2021

randomvariable commented Aug 20, 2021

sbueringer commented Aug 20, 2021

vincepri commented Sep 30, 2021

k8s-triage-robot commented Jan 20, 2022

fabriziopandini commented Jan 20, 2022

fabriziopandini commented Sep 30, 2022

sbueringer commented Jan 3, 2023 • edited Loading

k8s-triage-robot commented Apr 3, 2023

fabriziopandini commented Apr 14, 2023

enxebre commented Jun 30, 2023

sbueringer commented Jun 30, 2023

k8s-triage-robot commented Jan 19, 2024

fabriziopandini commented Apr 22, 2024

chrischdi commented Apr 23, 2024 • edited Loading

k8s-ci-robot commented Apr 23, 2024

sbueringer commented Apr 23, 2024

chrischdi commented Apr 23, 2024

sbueringer commented Apr 23, 2024

sbueringer commented Apr 23, 2024 • edited Loading

chrischdi commented Apr 23, 2024

sbueringer commented Jan 3, 2023 •

edited

Loading

chrischdi commented Apr 23, 2024 •

edited

Loading

sbueringer commented Apr 23, 2024 •

edited

Loading