Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gate: high failure ratio because of Workload cluster (without lb) creation #927

Closed
jichenjc opened this issue Jul 1, 2021 · 13 comments · Fixed by #930, #938 or #939
Closed

gate: high failure ratio because of Workload cluster (without lb) creation #927

jichenjc opened this issue Jul 1, 2021 · 13 comments · Fixed by #930, #938 or #939
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jichenjc
Copy link
Contributor

jichenjc commented Jul 1, 2021

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

several PRs suffer same issue and retry 1-2 times still have same issue

[1] 
[1] �[91m�[1m• Failure [1891.637 seconds]�[0m
[1] e2e tests
[1] �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:49�[0m
[1]   Workload cluster (without lb)
[1]   �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:133�[0m
[1]     �[91m�[1mIt should be creatable and deletable [It]�[0m
[1]     �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:175�[0m
[1] 
[1]     �[91mTimed out after 1800.001s.
[1]     Expected
[1]         <bool>: false
[1]     to be true�[0m
[1] 

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/924/pull-cluster-api-provider-openstack-e2e-test/1410422201777131520/build-log.txt

need check what's happened lead to so high failure ratio

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built):
  • Cluster-API version:
  • OpenStack version:
  • Minikube/KIND version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 1, 2021
@jichenjc
Copy link
Contributor Author

jichenjc commented Jul 9, 2021

per
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/924/pull-cluster-api-provider-openstack-e2e-test/1412692551504236544/artifacts/clusters/bootstrap/controllers/capo-controller-manager/capo-controller-manager-6f4cdd5947-42qv4/manager.log

it's because we didn't associate the floating correctly

I0707 09:26:57.481439 1 openstackmachine_controller.go:258] controller-runtime/manager/controller/openstackmachine "msg"="Error state detected, skipping reconciliation"

Assume it's timing issue (create floating IP then need wait some time to associate? )
because we create a floating ip then tried to associate right after creation
and that might lead to timing issue, so better to retry association after wait some time about the creation

@tobiasgiese
Copy link
Member

@jichenjc what PR do you mean? If you mean #925, then I don't think that it's related to this issue. It was already present prior to the merge of the PR.

@jichenjc
Copy link
Contributor Author

jichenjc commented Jul 9, 2021

@jichenjc what PR do you mean? If you mean #925, then I don't think that it's related to this issue. It was already present prior to the merge of the PR.

not that one, I only give the log as example...

@jichenjc
Copy link
Contributor Author

jichenjc commented Jul 13, 2021

looks like I made some wrong assumption based on previous info

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/924/pull-cluster-api-provider-openstack-e2e-test/1414766955063152640/artifacts/clusters/bootstrap/controllers/capo-controller-manager/capo-controller-manager-6f4cdd5947-fjlxx/manager.log

first create floating IP

I0713 02:49:33.906406 1 recorder.go:104] controller-runtime/manager/events "msg"="Normal" "message"="Created floating IP 172.24.4.66 with id 6c0753d4-b5a8-43e0-9ea5-c7cb1373ecb6" "object"={"kind":"OpenStackCluster","namespace":"e2e-ae45wj","name":"cluster-e2e-ae45wj","uid":"cf4ef561-17c1-4818-9592-9e442fbed6e3","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"5528"} "reason"="SuccessfulCreateFloatingIP"

then I am still seeing those errors (repeat multiple times)

I0713 02:50:07.053860 1 openstackmachine_controller.go:371] controller-runtime/manager/controller/openstackmachine "msg"="Floating IP association failed, will retry." "cluster"="cluster-e2e-ae45wj" "machine"="cluster-e2e-ae45wj-control-plane-b9qn6" "name"="cluster-e2e-ae45wj-control-plane-lhsdm" "namespace"="e2e-ae45wj" "openStackCluster"="cluster-e2e-ae45wj" "openStackMachine"="cluster-e2e-ae45wj-control-plane-lhsdm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "floating-ip"="172.24.4.66" "instance-id"="be74324c-e2cf-4d93-ad80-6c8df076ce9a"

so apparaently it's not the floating IP not create on time issue, might be some other concurrent issue ..
need further read logs

@jichenjc
Copy link
Contributor Author

not able to find openstack logs so not sure what happened (why keep getting NotFound error)
will add more debug logs

@tobiasgiese
Copy link
Member

What I have seen in many of my PR e2e tests

[2] • Failure [1507.447 seconds]
[2] e2e tests
[2] /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:49
[2]   Workload cluster (without lb)
[2]   /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:133
[2]     Should create port(s) with custom options [It]
[2]     /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:134
[2] 
[2]     Expected
      
[2]         <int>: 0
[2]     to equal
[2]         <int>: 1
[2] 
[2]     /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:171

xref: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/932/pull-cluster-api-provider-openstack-e2e-test/1414482588461961216

@macaptain
Copy link
Contributor

The Ginkgo tests run in parallel:

time $(GINKGO) -trace -progress -v -tags=e2e --nodes=2 $(E2E_GINKGO_ARGS) ./test/e2e/suites/e2e/... -- -config-path="$(E2E_CONF_PATH)" -artifacts-folder="$(ARTIFACTS)" --data-folder="$(E2E_DATA_DIR)" $(E2E_ARGS)

Setting -nodes to anything greater than 1 implies a parallelized test run.

Is it possible that the two specs in the "Workload cluster (without lb)" description are running at the same time, and one of them is deleting the cluster while the other is trying to create ports on it?

@iamemilio
Copy link
Contributor

@jichenjc
Copy link
Contributor Author

The Ginkgo tests run in parallel:

time $(GINKGO) -trace -progress -v -tags=e2e --nodes=2 $(E2E_GINKGO_ARGS) ./test/e2e/suites/e2e/... -- -config-path="$(E2E_CONF_PATH)" -artifacts-folder="$(ARTIFACTS)" --data-folder="$(E2E_DATA_DIR)" $(E2E_ARGS)

Setting -nodes to anything greater than 1 implies a parallelized test run.

Is it possible that the two specs in the "Workload cluster (without lb)" description are running at the same time, and one of them is deleting the cluster while the other is trying to create ports on it?

Thanks for the info, as the ratio is high suddenly and we should not delete/create ports frequently ,so I guess less likely but I will take into consideration, thanks

@jichenjc
Copy link
Contributor Author

I think that Should create port(s) with custom options [It] is also flaking.

See: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/921/pull-cluster-api-provider-openstack-e2e-test/1414589415367380992

yes, and less flaky than the other one, still need check it ...

@macaptain
Copy link
Contributor

I noticed the ports e2e test creates a machine and then waits for an error condition not to occur before continuing. But it's possible that the error condition didn't occur because the machine wasn't created yet, and then the tests continue anyway. This may be contributing to the flakiness. I raised #938 to hopefully wait for the machine to be created before trying to list the ports.

@tobiasgiese
Copy link
Member

tobiasgiese commented Jul 15, 2021

Was it intended to close this issue @hidekazuna?
I think it's because of your description "Because #930 did not fix #927." :)
(i.e., "fix #927" will close the issue)

@jichenjc jichenjc reopened this Jul 15, 2021
@hidekazuna
Copy link
Contributor

Was it intended to close this issue @hidekazuna?
I think it's because of your description "Because #930 did not fix #927." :)
(i.e., "fix #927" will close the issue)

@tobiasgiese Ah, I do not want to close this issue definitely :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment